EPA/600/A-96/102
Smoothing of Breeding Bird Survey Data
To Produce National Biodiversity Estimates
Kwang-Su Yaagf, Daniel D. Carrf, Raymond J. O'Connor
fGeorge Mason University, Fairfax, VA 22030
^University of Maine, Orono, ME 04469
Abstract
The Breeding Bird Survey (BBS) is an annual sur-
vey conducted since 1966 at approximately 3000 25-mile
long routes within the continental U.S. The purpose of
the present analysis is to derive from the BBS a biodi-
versity measure based on incidence of 615 species. This
analysis purposely avoids use of concomitant informa-
tion concerning habitat because a later analysis will re-
late the biodiversity to habitat.
A preliminary analysis of abundance and prevalence
data focused attention on 342 species at 1870 routes in
the 1990 data. This smoothing process had two steps.
The first smoothing step addressed a problem in cen-
sus efficiency: occasionally observers could fail to de-
tect a species present on a route. The smoother used a
distance-limited median-based near-neighbor smoother
to impute bird occurrence at sites where birds were
not observed. To provide a basis for later compar-
ison against habitat indicators, the second smooth-
ing step,«.used near-neighbor logistic regression to esti-
mate probabilities on the U.S. Environmental Protec-
tion Agency's Environmental Monitoring and Assess-
ment Program (EMAP) hexagon grid. The biodiversity
measure was the number of species of birds exceeding
a specific threshold of probability at each EMAP grid
point.
Expert review of single year smoothing identified a sec-
ond issue. Species could be present on a BBS route in
some years but not in others. Review of long-running
BBS route suggested that a ten-year period provides a
stable estimate of a species' presence at a route. Almost
all species appearing at a route are detected sometime
during ten years. This motivated a new analysis based
on a ten-year window from 1981 through 1990. The re-
analysis used a logistic regression on 615 species of birds
at 2909 routes. The paper presents selected maps from
the first and second smoothing efforts. This paper pro-
vides some background on interpretation of results but
primarily emphasizes on computation details. Later pa-
pers will delve into interpretation and model criticism.
1 Introduction
A pressing problem in biodiversity studies is to deter-
mine how to devote scarce resources to obtain max-
imum protection of plant and animal species for the
least social and economic cost. Ideally this could be
determined using some form of prioritization algorithm
on a regular grid of spatial units for each of which a
measure of biodiversity is known (Kiester et al., 1993).
In practice, however, biodiversity, typically in the form
of species richness estimates, has been assessed only at
some smaller set of locations which may or may not co-
incide with regular grid points. Hence any prioritization
based on a regular grid needs some form of interpola-
tion from the available estimates.
One approach to comprehensive spatial interpolation
has been the development by the U.S. Fish and Wildlife
Service (USFWS) (now the National Biological Service
(NBS)) of the GAP program (Scott et al., 1990). This
uses information on wildlife-habitat relationships to im-
pute species presence or absence within vegetation poly-
gons determined using remote sensing data. Subsequent
overlaying of the set (or possibly a selected subset) of
species distribution maps shows concentrations of (com-
prehensive or selected) species richness over space.
One limitation of the GAP program approach is that it
uses species-vegetation relationships in establishing pat-
terns of species richness and thus precludes subsequent
analysis of potential environmental (vegetation) causes
of the biodiversity pattern established. We therefore
investigated how one might use purely spatial sample
information as to biological distributions in determin-
ing pattern of species richness across a regular grid. We
used data from the NBS Breeding Bird Survey (BBS),
conducted annually since 1966 (Bobbins et al., 1986), as
a prototype for the national comprehensive biodiversity
assessment envisaged by Kiester et al. (1993). In an ini-
tial analysis here we considered only 1990 bird survey
data, focusing on Kiester et al.'s idea that one might
be able to track *"""*1 changes in biodiversity in rela-
tion to annual remote sensing data as to environmental
change. In a second analysis we focus on the more ro-

-------
406
bust biodiversity estimates that come from multi-year
estimates of distribution.
Our approach used local logistic regression to estimate
values at 13922 EMAP grid points. Section 2 provides
the regression details. The details vary considerably
from the single year 1990 analysis and the ten-year
1981-1990 analysis. Section 3 provides a few details
about the graphical representation of the data. Section
4 describes conclusions and further work to be done.
2 Data and Smoothing Procedures
The original data consist of abundance measures
for species observed at approximately 3000 25-mile
long BBS routes. While direct consideration of
the abundance data can provide valuable insights,
this study focuses on biodiversity (For more de-
tails on BBS and species abundance maps see
http://www.im.nbs.gov/bbs). This analysis transforms
the counts into the binary measures of presence or ab-
sence. This measure of presence or absence of a species
is subject to two kinds of error, failure to observe species
that are present along the route and mis-identification
of species. The first problem is significant and moti-
vates the two different analyses in this section, a sin-
gle year analysis and a ten-year window analysis. The
second problem is disregarded for the present. Mis-
identification of a species is likely to be rare because
observer ability is routinely screened by BBS staff.
The initial effort to estimate values for the EMAP
grid points focused on the 1990 data. This data con-
sisted of 1870 routes and reports on 342 bird species.
This smoothing process involved an imputation step for
species not observed. The step did not provide satisfac-
tory results, which motivated a second analysis.
The second analysis considered a ten-year window from
1981-1990. This involved 2909 routes and 615 species.
Not all routes were monitored during all the years. In
contrast to the binary data for 1990, the ten-year win-
dow data for each species at each route was binomial in-
cidence data. The data considered the number of years
the species was observed out of the number of years the
route was monitored.
The ten-year window analysis was similar to the sin-
gle year analysis in that both used a 100 mile radius
to define the local neighborhoods in the smoothing pro-
cess. The choice reflected expert opinion about how far
species travel during the breeding season and represents
a compromise. Clearly some species travel further. Fur-
ther smoothing effort should consider species specific
neighborhoods. A study of variograms for data sets like
K. Yang, D. B. Can, and R. J. O'Connor
this may lead to further understanding of appropriate
neighborhood sizes.
The smoothing efforts involved preprocessing the data
to create local data sets within 100 miles of each estima-
tion point. The subsequent processing then addressed
imputation and smoothing of all species for each esti-
mation point.
2.1 Implementation Details for the Single
Year Analysis
The 1990 data did not fully reflect the presence of
species. We adjusted the data before smoothing. As
indicated previously, observations recorded as presence
are assumed to have low error rates. These observations
are not altered. The imputation algorithm changes ab-
sence to presence when the species was present at more
than half of the routes within 100 mile (If there were
more than 17 such routes, only the closest 17 were con-
sidered).
The smoothing stage used distance as covariate in the
local logistic regression (Hastie et al., 1991). We calcu-
lated distance using great arc distance. Special codes
flagged EMAP grid points that had no near neighbor
routes and fewer than 4 near neighbor routes all of
which are beyond 50 miles away. With a constant and
covariate, the model had two parameters. The iterative
fitting exploited the availability of a trivial explicit in-
verse and we easily coded the model in Fortran. For
this simple model, the running time was not a problem.
The results of the preliminary smoothing were not sat-
isfactory. The imputation step was inadequate. Routes
near the edge of the species range, for example, might
be colonized in 'good' years by juveniles unable to find
vacant breeding territories within the core range, yet be
left vacant in 'poor' years when all juveniles could find
territories vacant after the death of the original terri-
tory holder during the intervening winter (O'Connor,
1987). Investigation of species accumulation on BBS
routes surveyed annually over many years showed that
species richness estimates stablized after about 10 years
of censusing (R. J. O'Connor and M. T. Jones, in prepa-
ration).
The single covariate logistic regression model provided
limits capabilities to closely model the variation in pat-
terns of presence and absence with a 100 mile radius.
Consequently, the ten-year window analysis used a local
quadratic modeL

-------
Smoothing of Breeding Bird Survey Data to Produce
2.2 Implementation Details for the
Ten-Year Analysis
The ten-year analysis did not include an imputation
step. Rather, it proceeded directly to the local logistic
regression (McCullagh et al., 1983). The local quadratic
model required both x and y coordinates. Consistent
with the standard EMAP grid point, we chose to use
the Lambert azimuthal projection to obtain x and y.
Distances on this projection do not exactly match great
arc distances but the percentage error is negligible over
a radius of 100 miles.
We chose to downweight observations as a function of
distance. The weight used here is (1 — (d/sc)2)3 where d
is Euclidean distance in Lambert azimuthal coordinates
from the EMAP grid point to a near neighbor route and
ac is the furthest of the distance from a specific EMAP
grid point to near neighbor routes.
The use of quadratic model proved problematic. We
linked a standard subroutine library (but here un-
named) into the Fortran code. Then, we discovered
that contrary to the documentation the subroutine cor-
rectly handles only three covariates. We turned to Splus
which provided a logistic algorithm at the level of QR
decompositions.
We took a brute force approach and ran a logistic model
for each species at each of the 13922 EMAP grid points.
This produced 8562030 regressions. The regression took
many hours on many machines. With millions of regres-
sions problems of collinearity and model degeneration
occurred. We modified the Splus code to trap degen-
erate situations and ran simpler models in these cases.
If the quadratic model did not work, we tried a pla-
nar model, then a simple distance model and finally a
constant model. We smoothed data for all species at
each EMAP grid point to derive as much benefit as we
could from prior computations. This helped somewhat
in that the design matrix needed to be constructed only
once at each grid point.
Loess (Cleveland et al., 1988) provides a more strate-
gic approach to smoothing. It computes smooths at
selected points and fills in the rest of the surface using
blending functions. Unfortunately, the loess approach
was not an immediate option since it does not use a
logistic model.
3 Graphical Representation of Results
Figure 1 shows the smoothed probability estimate map
of red tailed hawk for the 1990 BBS data and Figure
2 shows the smoothed incidence estimate map of red
tailed hawk for the decade 1981-1990. In Figure 1, re-
National Biodiversity Estimates	407
gions in white indicate areas where the data were con-
sidered inadequate for smoothing. Figure 2 has much
less area in the high incidence (black) class than Figure
1. The imputation step fills in regions of high preva-
lence so that smoothed values are often 1. The real test
of the two methods occurs at the threshold for deciding
presence or absence. This threshold is currently set at
0.1. While there are three dark classes showing the re-
gions, separate plots showing only this distinction are
easier to compare.
Figure 3 shows the biodiversity map derived using a cut-
off estimate of probability of 0.1 for 342 species actually
recorded in the 1990 BBS data and Figure 4 shows the
biodiversity map derived using a cut-off incidence esti-
mate of 0.1 for 615 species for the decade 1981-1990.
Figure 4 is darker than Figure 3, reflecting higher bio-
diversity. The biodiversity measure for each hexagon is
the sum of species present after thresholding the prob-
ability or incidence at 0.1. Other thresholds can be
considered as well as other measures considered such as
sum of incidences across species. The threshold at 0.1
has a convenient interpretation: the species is observed
at least once during the ten year period. For grid cells
the values are smoothed. The statement is not true at
the centroid but holds at some place within 100 miles.
4 Conclusions and Further Work
This paper presents two approaches to dealing with the
spatial interpolation to a grid of irregular spaced biolog-
ical distribution data. We estimated the probabilities of
occurrence of bird species in each of the 13922 hexagons
on the national (conterminous) EMAP grid for each of
the 342 species recorded in the 1990 BBS survey. Ad-
ditionally, we estimated the incidence at grid points of
each of 615 species for these hexagons for the decade
1981-1990. We converted both probability and inci-
dence estimates to presence-absence estimates for each
species using 0.1 threshold. Our estimate of the species
richness summed the presence-absence values over the
species.
Our analyses suggest that it is technically feasible to
prepare systematic grid distribution maps from irreg-
ular biological data without recourse to environmental
covariate information and without violence to reason-
able biological constraints. We intend to validate our
estimates against independent data sets of three types.
First, BBS data are available from locations not sur-
veyed during the 1981-1990 decade. Second, EMAP
hexagon grid estimates of the presence-absence of each
bird species have been independently prepared for two
states, in each case without recourse to the BBS data for

-------
408
the state. Finally, a number of state atlases of breeding
birds have been conducted in recent years. Our most
deeply embedded assumption is the use of the 100-mile
smoothing parameter for the EMAP grid and this will
be re-visited, although initial semi-variogram analyses
of regional BBS distributions support the adequacy of
this distance. We also need to consider the impact of
alternative clipping strategies, especially in relation to
island locations.
Assuming successful validation of the present estimates,
our work allows investigation of several biological is-
sues previously limited by data constraints. One is
the pattern of spatially extensive correlations between
groups of species. Performed on annual data, these
may reflect covariation in temporal dynamics of differ-
ent groups, potentially revealing common environmen-
tal impacts or common demographic processes within
groups of species. Using decadal data, on the other
hand, will allow investigation of quasi-equilibrium pro-
cesses, such as shared evolutionary constraints. Finally,
the new species distribution grid makes it possible to in-
vestigate spatially extensive environmental correlates of
individual species distribution, notable candidates be-
ing the effects of spatial patterns of habitat distribution
and of climatic constraints on bird distribution. In addi-
. tion, the data permit the application of the biodiversity
prioritization analyses proposed by conservation biolo-
gists (e.g., Kiester et al., 1993).
Acknowledgements
We thank Malcolm T. Jones for assistance in prepar-
ing the Breeding Bird Survey data. Financial sup-
port from the U.S. Environmental Protection Agency
through a Cooperative Agreement (CR823806-01-0) to
the University of Maine and a Cooperative Agreement
(CR8280820-01-0) to George Mason University is grate-
fully acknowledged. This article has not been subject
to the review of the EPA and thus does not necessarily
reflect the view of the agency and no official endorse-
ment should be inferred.
References
Cleveland, W. S. and S. J. Devlin (1988) "Locally-
weighted Regression: An Approach to Regression Anal-
ysis by Local Fitting" Journal of American Statistical
Association, 83, 596-610.
K. Yang, D. B. Can, and R. J. O'Connor
Kiester, A. R., D. White, E. M. Preston, L. M. Mas-
ter, T. R. Loveland, D. F. Bradford, B. A. Csuti, R.
J. O'Connor, F. W. Davis, and D. M. Storms (1993)
Research Plan for Pilot Studies of the Biodiversity
Research Consortium. U.S. Environmental Protection
Agency.
McCullagh, P. and J. A. Nelder (1983) Generalized Lin-
ear Models Chapman & Hall, New York, NY.
O'Connor, R. J. (1987) "Organization of Avian As-
semblages - The Influence of Intraspecific Habitat Dy-
namics" In J. Gee and P. R. Giller, eds., Organisation
of communities - past and present Blackwell Scientific
Publications, Oxford (British Ecological Society Sym-
posium No. 26), 163-183.
Robbins, C. S., D. Bystrak, and P. H. Geissler (1986)
Breeding Bird Survey: Its First Fifteen Years Fish
Wildl. Serv. Resource Publ. 157, Washington D.C.
196 pp.
Scott, J. M., F. Davis, B. Csuti, B. Butterfield, R. Noss,
S. Caicco, H. Anderson, J. Ulliman, F. D'Erchia, and
C. Groves (1990) GAP Analysis: Protecting Biodiver-
sity using Geographic Information Systems Idaho Coop.
Fish. Wildl. Research Unit, University of Idaho, Idaho
176 pp.
Hastie T. J. and R. J. Tibshirani (1991) Generalized
Additive Models Chapman & Hall, New York, NY.

-------
* Smoothing of Breeding Bird Survey Data to Produce National Biodiversity Estimates
409
Red Tailed Hawk	Red Tailed Hawk
Figure 1. Smoothed probability estimate map of red	Figure 2. Smoothed incidence estimate map of red
tailed hawk for the 1990 BBS data.	tailed hawk for the decade 1981-1990.
Biodiversity Map	Biodiversity Map
Figure 3. Biodiversity map derived using a cut-off	Figure 4. Biodiversity map derived using a cut-off
estimate of probability of 0.1 for 342 species actually incidence estimate of 0.1 for 615 species for the decade
recorded in the 1990 BBS data.	1981-1990.

-------
NHEERL-C0R-2041A
TECHNICAL REPORT DATA
(Please read instructions on the reverse before comn
' REPCTO?y600/A-96/1 02
2.
3.
4. TITLE AND SUBTITLE
Smoothing of Breeding Bird Survey Data To Produce National Biodiversity
Etimates
6. REPORT DATE
6. PERFORMING ORGANIZATION CODE
7. AUTHORIS)
Yang, K-S.,and D.B.Carr(l) R.J.O'Connor(2)
8. PERFORMING ORGANIZATION REPORT NO.
9. PERFORMING ORGANIZATION NAME AND ADDRESS
(1) George Mason University (2)University of Maine
Fairfax, VA 22030 Orono, ME 04469
10. PROGRAM ELEMENT NO.
11. CONTRACT/GRANT NO.
12. SPONSORING AGENCY NAME AND ADDRESS
US EPA ENVIRONMENTAL RESEARCH LABORATORY
200 SW 35th Street
Corvallis, OR 97333
13. TYPE OF REPORT AND PERIOD COVERED
Symposium paper
14. SPONSORING AGENCY CODE
EPA/600/02
15. SUPPLEMENTARY NOTES
1995. Proceedings of the 27th Symposium on the Interface. Computer Science and Statistics, Statistics &
Manufacturing with Subthemes in Environmental Statistics, Graphics, and Imaging, Pittsburgh, Pa, June 21-24,
1995.
24, 1995.
16. ABSTRACT
/The Breeding Bird Survey (BBS) is an annual survey conducted since 1966 at approximately 3000 25-mile long
routes within the continental U.S. The purpose of the present analysis is to derive from the BBS a biodiversity
measure based on incidence of 61 5 species. This analysis purposely avoids use of concomitant information
concerning habitat because a later analysis will relate the biodiversity to habitat.
A preliminary analysis of abundance and prevalence data focused attention on 342 species at 1870 routes in the
1990 data.The first smoothing step addressed a problem in census efficiency: occasionally observers fail to detect
species present on route. The smoother used a distance-limited median-based near-neighbor smoother to impute bird
occurance where birds were not observed. To provide a basis for later comparison against habitat indicators,the
second smoothing step used near-neighbor logistic regression to estimate probabilities on the US Environmental
Protection Agency's Environmental Monitoring and Assessment Program hexigon grid. The biodivesity measure was
the number of species of birds exceeding a specific threshold of probability at each EMAP grid point.
Expert review of single year smoothing identified a second issue. Species could be present on a BBS route in some
years,but not others. Review of long-running BBS routes suggested a ten-year period provides a stable estimate of
species' presence at a route.Almost all species appearing at a route are detected sometime during ten years. This
motivated a new analysis based on a ten-year window from 1981-1990. The reanalysis used a logistic regression
on 615 species of birds at 1909 routes. The paper presents selected maps from the first and second smoothing
efforts. This paper provides some background on interpretation of results but primarily emphasizes computation
details. Later papers will delve into interpretation and model criticism.
17. KEY WORDS AND DOCUMENT ANALYSIS"-^
a. DESCRIPTORS
b. IDENTIFIERS/OPEN ENDED TERMS
c. COSATI,Field/Group
Breeding bird survey, biodiversity,
statistical graphics, environmental
statistics.


16. DISTRIBUTION STATEMENT
19. SECURITY CLASS [This Report)
21 NO. OF PAGES
5

-------