EPA/600/A-96/102 Smoothing of Breeding Bird Survey Data To Produce National Biodiversity Estimates Kwang-Su Yaagf, Daniel D. Carrf, Raymond J. O'Connor fGeorge Mason University, Fairfax, VA 22030 ^University of Maine, Orono, ME 04469 Abstract The Breeding Bird Survey (BBS) is an annual sur- vey conducted since 1966 at approximately 3000 25-mile long routes within the continental U.S. The purpose of the present analysis is to derive from the BBS a biodi- versity measure based on incidence of 615 species. This analysis purposely avoids use of concomitant informa- tion concerning habitat because a later analysis will re- late the biodiversity to habitat. A preliminary analysis of abundance and prevalence data focused attention on 342 species at 1870 routes in the 1990 data. This smoothing process had two steps. The first smoothing step addressed a problem in cen- sus efficiency: occasionally observers could fail to de- tect a species present on a route. The smoother used a distance-limited median-based near-neighbor smoother to impute bird occurrence at sites where birds were not observed. To provide a basis for later compar- ison against habitat indicators, the second smooth- ing step,«.used near-neighbor logistic regression to esti- mate probabilities on the U.S. Environmental Protec- tion Agency's Environmental Monitoring and Assess- ment Program (EMAP) hexagon grid. The biodiversity measure was the number of species of birds exceeding a specific threshold of probability at each EMAP grid point. Expert review of single year smoothing identified a sec- ond issue. Species could be present on a BBS route in some years but not in others. Review of long-running BBS route suggested that a ten-year period provides a stable estimate of a species' presence at a route. Almost all species appearing at a route are detected sometime during ten years. This motivated a new analysis based on a ten-year window from 1981 through 1990. The re- analysis used a logistic regression on 615 species of birds at 2909 routes. The paper presents selected maps from the first and second smoothing efforts. This paper pro- vides some background on interpretation of results but primarily emphasizes on computation details. Later pa- pers will delve into interpretation and model criticism. 1 Introduction A pressing problem in biodiversity studies is to deter- mine how to devote scarce resources to obtain max- imum protection of plant and animal species for the least social and economic cost. Ideally this could be determined using some form of prioritization algorithm on a regular grid of spatial units for each of which a measure of biodiversity is known (Kiester et al., 1993). In practice, however, biodiversity, typically in the form of species richness estimates, has been assessed only at some smaller set of locations which may or may not co- incide with regular grid points. Hence any prioritization based on a regular grid needs some form of interpola- tion from the available estimates. One approach to comprehensive spatial interpolation has been the development by the U.S. Fish and Wildlife Service (USFWS) (now the National Biological Service (NBS)) of the GAP program (Scott et al., 1990). This uses information on wildlife-habitat relationships to im- pute species presence or absence within vegetation poly- gons determined using remote sensing data. Subsequent overlaying of the set (or possibly a selected subset) of species distribution maps shows concentrations of (com- prehensive or selected) species richness over space. One limitation of the GAP program approach is that it uses species-vegetation relationships in establishing pat- terns of species richness and thus precludes subsequent analysis of potential environmental (vegetation) causes of the biodiversity pattern established. We therefore investigated how one might use purely spatial sample information as to biological distributions in determin- ing pattern of species richness across a regular grid. We used data from the NBS Breeding Bird Survey (BBS), conducted annually since 1966 (Bobbins et al., 1986), as a prototype for the national comprehensive biodiversity assessment envisaged by Kiester et al. (1993). In an ini- tial analysis here we considered only 1990 bird survey data, focusing on Kiester et al.'s idea that one might be able to track *"""*1 changes in biodiversity in rela- tion to annual remote sensing data as to environmental change. In a second analysis we focus on the more ro- ------- 406 bust biodiversity estimates that come from multi-year estimates of distribution. Our approach used local logistic regression to estimate values at 13922 EMAP grid points. Section 2 provides the regression details. The details vary considerably from the single year 1990 analysis and the ten-year 1981-1990 analysis. Section 3 provides a few details about the graphical representation of the data. Section 4 describes conclusions and further work to be done. 2 Data and Smoothing Procedures The original data consist of abundance measures for species observed at approximately 3000 25-mile long BBS routes. While direct consideration of the abundance data can provide valuable insights, this study focuses on biodiversity (For more de- tails on BBS and species abundance maps see http://www.im.nbs.gov/bbs). This analysis transforms the counts into the binary measures of presence or ab- sence. This measure of presence or absence of a species is subject to two kinds of error, failure to observe species that are present along the route and mis-identification of species. The first problem is significant and moti- vates the two different analyses in this section, a sin- gle year analysis and a ten-year window analysis. The second problem is disregarded for the present. Mis- identification of a species is likely to be rare because observer ability is routinely screened by BBS staff. The initial effort to estimate values for the EMAP grid points focused on the 1990 data. This data con- sisted of 1870 routes and reports on 342 bird species. This smoothing process involved an imputation step for species not observed. The step did not provide satisfac- tory results, which motivated a second analysis. The second analysis considered a ten-year window from 1981-1990. This involved 2909 routes and 615 species. Not all routes were monitored during all the years. In contrast to the binary data for 1990, the ten-year win- dow data for each species at each route was binomial in- cidence data. The data considered the number of years the species was observed out of the number of years the route was monitored. The ten-year window analysis was similar to the sin- gle year analysis in that both used a 100 mile radius to define the local neighborhoods in the smoothing pro- cess. The choice reflected expert opinion about how far species travel during the breeding season and represents a compromise. Clearly some species travel further. Fur- ther smoothing effort should consider species specific neighborhoods. A study of variograms for data sets like K. Yang, D. B. Can, and R. J. O'Connor this may lead to further understanding of appropriate neighborhood sizes. The smoothing efforts involved preprocessing the data to create local data sets within 100 miles of each estima- tion point. The subsequent processing then addressed imputation and smoothing of all species for each esti- mation point. 2.1 Implementation Details for the Single Year Analysis The 1990 data did not fully reflect the presence of species. We adjusted the data before smoothing. As indicated previously, observations recorded as presence are assumed to have low error rates. These observations are not altered. The imputation algorithm changes ab- sence to presence when the species was present at more than half of the routes within 100 mile (If there were more than 17 such routes, only the closest 17 were con- sidered). The smoothing stage used distance as covariate in the local logistic regression (Hastie et al., 1991). We calcu- lated distance using great arc distance. Special codes flagged EMAP grid points that had no near neighbor routes and fewer than 4 near neighbor routes all of which are beyond 50 miles away. With a constant and covariate, the model had two parameters. The iterative fitting exploited the availability of a trivial explicit in- verse and we easily coded the model in Fortran. For this simple model, the running time was not a problem. The results of the preliminary smoothing were not sat- isfactory. The imputation step was inadequate. Routes near the edge of the species range, for example, might be colonized in 'good' years by juveniles unable to find vacant breeding territories within the core range, yet be left vacant in 'poor' years when all juveniles could find territories vacant after the death of the original terri- tory holder during the intervening winter (O'Connor, 1987). Investigation of species accumulation on BBS routes surveyed annually over many years showed that species richness estimates stablized after about 10 years of censusing (R. J. O'Connor and M. T. Jones, in prepa- ration). The single covariate logistic regression model provided limits capabilities to closely model the variation in pat- terns of presence and absence with a 100 mile radius. Consequently, the ten-year window analysis used a local quadratic modeL ------- Smoothing of Breeding Bird Survey Data to Produce 2.2 Implementation Details for the Ten-Year Analysis The ten-year analysis did not include an imputation step. Rather, it proceeded directly to the local logistic regression (McCullagh et al., 1983). The local quadratic model required both x and y coordinates. Consistent with the standard EMAP grid point, we chose to use the Lambert azimuthal projection to obtain x and y. Distances on this projection do not exactly match great arc distances but the percentage error is negligible over a radius of 100 miles. We chose to downweight observations as a function of distance. The weight used here is (1 — (d/sc)2)3 where d is Euclidean distance in Lambert azimuthal coordinates from the EMAP grid point to a near neighbor route and ac is the furthest of the distance from a specific EMAP grid point to near neighbor routes. The use of quadratic model proved problematic. We linked a standard subroutine library (but here un- named) into the Fortran code. Then, we discovered that contrary to the documentation the subroutine cor- rectly handles only three covariates. We turned to Splus which provided a logistic algorithm at the level of QR decompositions. We took a brute force approach and ran a logistic model for each species at each of the 13922 EMAP grid points. This produced 8562030 regressions. The regression took many hours on many machines. With millions of regres- sions problems of collinearity and model degeneration occurred. We modified the Splus code to trap degen- erate situations and ran simpler models in these cases. If the quadratic model did not work, we tried a pla- nar model, then a simple distance model and finally a constant model. We smoothed data for all species at each EMAP grid point to derive as much benefit as we could from prior computations. This helped somewhat in that the design matrix needed to be constructed only once at each grid point. Loess (Cleveland et al., 1988) provides a more strate- gic approach to smoothing. It computes smooths at selected points and fills in the rest of the surface using blending functions. Unfortunately, the loess approach was not an immediate option since it does not use a logistic model. 3 Graphical Representation of Results Figure 1 shows the smoothed probability estimate map of red tailed hawk for the 1990 BBS data and Figure 2 shows the smoothed incidence estimate map of red tailed hawk for the decade 1981-1990. In Figure 1, re- National Biodiversity Estimates 407 gions in white indicate areas where the data were con- sidered inadequate for smoothing. Figure 2 has much less area in the high incidence (black) class than Figure 1. The imputation step fills in regions of high preva- lence so that smoothed values are often 1. The real test of the two methods occurs at the threshold for deciding presence or absence. This threshold is currently set at 0.1. While there are three dark classes showing the re- gions, separate plots showing only this distinction are easier to compare. Figure 3 shows the biodiversity map derived using a cut- off estimate of probability of 0.1 for 342 species actually recorded in the 1990 BBS data and Figure 4 shows the biodiversity map derived using a cut-off incidence esti- mate of 0.1 for 615 species for the decade 1981-1990. Figure 4 is darker than Figure 3, reflecting higher bio- diversity. The biodiversity measure for each hexagon is the sum of species present after thresholding the prob- ability or incidence at 0.1. Other thresholds can be considered as well as other measures considered such as sum of incidences across species. The threshold at 0.1 has a convenient interpretation: the species is observed at least once during the ten year period. For grid cells the values are smoothed. The statement is not true at the centroid but holds at some place within 100 miles. 4 Conclusions and Further Work This paper presents two approaches to dealing with the spatial interpolation to a grid of irregular spaced biolog- ical distribution data. We estimated the probabilities of occurrence of bird species in each of the 13922 hexagons on the national (conterminous) EMAP grid for each of the 342 species recorded in the 1990 BBS survey. Ad- ditionally, we estimated the incidence at grid points of each of 615 species for these hexagons for the decade 1981-1990. We converted both probability and inci- dence estimates to presence-absence estimates for each species using 0.1 threshold. Our estimate of the species richness summed the presence-absence values over the species. Our analyses suggest that it is technically feasible to prepare systematic grid distribution maps from irreg- ular biological data without recourse to environmental covariate information and without violence to reason- able biological constraints. We intend to validate our estimates against independent data sets of three types. First, BBS data are available from locations not sur- veyed during the 1981-1990 decade. Second, EMAP hexagon grid estimates of the presence-absence of each bird species have been independently prepared for two states, in each case without recourse to the BBS data for ------- 408 the state. Finally, a number of state atlases of breeding birds have been conducted in recent years. Our most deeply embedded assumption is the use of the 100-mile smoothing parameter for the EMAP grid and this will be re-visited, although initial semi-variogram analyses of regional BBS distributions support the adequacy of this distance. We also need to consider the impact of alternative clipping strategies, especially in relation to island locations. Assuming successful validation of the present estimates, our work allows investigation of several biological is- sues previously limited by data constraints. One is the pattern of spatially extensive correlations between groups of species. Performed on annual data, these may reflect covariation in temporal dynamics of differ- ent groups, potentially revealing common environmen- tal impacts or common demographic processes within groups of species. Using decadal data, on the other hand, will allow investigation of quasi-equilibrium pro- cesses, such as shared evolutionary constraints. Finally, the new species distribution grid makes it possible to in- vestigate spatially extensive environmental correlates of individual species distribution, notable candidates be- ing the effects of spatial patterns of habitat distribution and of climatic constraints on bird distribution. In addi- . tion, the data permit the application of the biodiversity prioritization analyses proposed by conservation biolo- gists (e.g., Kiester et al., 1993). Acknowledgements We thank Malcolm T. Jones for assistance in prepar- ing the Breeding Bird Survey data. Financial sup- port from the U.S. Environmental Protection Agency through a Cooperative Agreement (CR823806-01-0) to the University of Maine and a Cooperative Agreement (CR8280820-01-0) to George Mason University is grate- fully acknowledged. This article has not been subject to the review of the EPA and thus does not necessarily reflect the view of the agency and no official endorse- ment should be inferred. References Cleveland, W. S. and S. J. Devlin (1988) "Locally- weighted Regression: An Approach to Regression Anal- ysis by Local Fitting" Journal of American Statistical Association, 83, 596-610. K. Yang, D. B. Can, and R. J. O'Connor Kiester, A. R., D. White, E. M. Preston, L. M. Mas- ter, T. R. Loveland, D. F. Bradford, B. A. Csuti, R. J. O'Connor, F. W. Davis, and D. M. Storms (1993) Research Plan for Pilot Studies of the Biodiversity Research Consortium. U.S. Environmental Protection Agency. McCullagh, P. and J. A. Nelder (1983) Generalized Lin- ear Models Chapman & Hall, New York, NY. O'Connor, R. J. (1987) "Organization of Avian As- semblages - The Influence of Intraspecific Habitat Dy- namics" In J. Gee and P. R. Giller, eds., Organisation of communities - past and present Blackwell Scientific Publications, Oxford (British Ecological Society Sym- posium No. 26), 163-183. Robbins, C. S., D. Bystrak, and P. H. Geissler (1986) Breeding Bird Survey: Its First Fifteen Years Fish Wildl. Serv. Resource Publ. 157, Washington D.C. 196 pp. Scott, J. M., F. Davis, B. Csuti, B. Butterfield, R. Noss, S. Caicco, H. Anderson, J. Ulliman, F. D'Erchia, and C. Groves (1990) GAP Analysis: Protecting Biodiver- sity using Geographic Information Systems Idaho Coop. Fish. Wildl. Research Unit, University of Idaho, Idaho 176 pp. Hastie T. J. and R. J. Tibshirani (1991) Generalized Additive Models Chapman & Hall, New York, NY. ------- * Smoothing of Breeding Bird Survey Data to Produce National Biodiversity Estimates 409 Red Tailed Hawk Red Tailed Hawk Figure 1. Smoothed probability estimate map of red Figure 2. Smoothed incidence estimate map of red tailed hawk for the 1990 BBS data. tailed hawk for the decade 1981-1990. Biodiversity Map Biodiversity Map Figure 3. Biodiversity map derived using a cut-off Figure 4. Biodiversity map derived using a cut-off estimate of probability of 0.1 for 342 species actually incidence estimate of 0.1 for 615 species for the decade recorded in the 1990 BBS data. 1981-1990. ------- NHEERL-C0R-2041A TECHNICAL REPORT DATA (Please read instructions on the reverse before comn ' REPCTO?y600/A-96/1 02 2. 3. 4. TITLE AND SUBTITLE Smoothing of Breeding Bird Survey Data To Produce National Biodiversity Etimates 6. REPORT DATE 6. PERFORMING ORGANIZATION CODE 7. AUTHORIS) Yang, K-S.,and D.B.Carr(l) R.J.O'Connor(2) 8. PERFORMING ORGANIZATION REPORT NO. 9. PERFORMING ORGANIZATION NAME AND ADDRESS (1) George Mason University (2)University of Maine Fairfax, VA 22030 Orono, ME 04469 10. PROGRAM ELEMENT NO. 11. CONTRACT/GRANT NO. 12. SPONSORING AGENCY NAME AND ADDRESS US EPA ENVIRONMENTAL RESEARCH LABORATORY 200 SW 35th Street Corvallis, OR 97333 13. TYPE OF REPORT AND PERIOD COVERED Symposium paper 14. SPONSORING AGENCY CODE EPA/600/02 15. SUPPLEMENTARY NOTES 1995. Proceedings of the 27th Symposium on the Interface. Computer Science and Statistics, Statistics & Manufacturing with Subthemes in Environmental Statistics, Graphics, and Imaging, Pittsburgh, Pa, June 21-24, 1995. 24, 1995. 16. ABSTRACT /The Breeding Bird Survey (BBS) is an annual survey conducted since 1966 at approximately 3000 25-mile long routes within the continental U.S. The purpose of the present analysis is to derive from the BBS a biodiversity measure based on incidence of 61 5 species. This analysis purposely avoids use of concomitant information concerning habitat because a later analysis will relate the biodiversity to habitat. A preliminary analysis of abundance and prevalence data focused attention on 342 species at 1870 routes in the 1990 data.The first smoothing step addressed a problem in census efficiency: occasionally observers fail to detect species present on route. The smoother used a distance-limited median-based near-neighbor smoother to impute bird occurance where birds were not observed. To provide a basis for later comparison against habitat indicators,the second smoothing step used near-neighbor logistic regression to estimate probabilities on the US Environmental Protection Agency's Environmental Monitoring and Assessment Program hexigon grid. The biodivesity measure was the number of species of birds exceeding a specific threshold of probability at each EMAP grid point. Expert review of single year smoothing identified a second issue. Species could be present on a BBS route in some years,but not others. Review of long-running BBS routes suggested a ten-year period provides a stable estimate of species' presence at a route.Almost all species appearing at a route are detected sometime during ten years. This motivated a new analysis based on a ten-year window from 1981-1990. The reanalysis used a logistic regression on 615 species of birds at 1909 routes. The paper presents selected maps from the first and second smoothing efforts. This paper provides some background on interpretation of results but primarily emphasizes computation details. Later papers will delve into interpretation and model criticism. 17. KEY WORDS AND DOCUMENT ANALYSIS"-^ a. DESCRIPTORS b. IDENTIFIERS/OPEN ENDED TERMS c. COSATI,Field/Group Breeding bird survey, biodiversity, statistical graphics, environmental statistics. 16. DISTRIBUTION STATEMENT 19. SECURITY CLASS [This Report) 21 NO. OF PAGES 5 ------- |