OCTOBER 3, 2007
EPA 260-R-08-003
FINAL REPORT FOR THE PILOT STUDY OF
TARGETING ELEVATED BLOOD-LEAD LEVELS IN
CHILDREN
Prepared by
BATTELLE
Prepared for:
Margaret Conomos, Work Assignment Manager
Barry Nussbaum, Technical Adviser
Analytical Products Branch
Environmental Analysis Division
Sineta Wooten, Project Officer
Program Assessment and Outreach Branch
National Program Chemical Division
Office of Pollution Prevention and Toxics
U.S. Environmental Protection Agency
1200 Pennsylvania Avenue NW (7404T)
Washington D.C. 20460
-------
BATTELLE DISCLAIMER
This report is a work prepared for the United States government by Battelle. In no event shall
either the United States government or Battelle have any responsibility or liability for any
consequences of any use, misuse, inability to use, or reliance upon the information contained
herein, nor does either warrant or otherwise represent in any way the accuracy, adequacy,
efficacy, or applicability of the contents hereof.
ACKNOWELDGEMENTS
The EPA and the authors thank the organizations whose contributions made this report possible
including the Lead Poisoning Prevention Branch at the Centers for Disease Control and
Prevention, the Childhood Lead Poisoning Prevention Program at the Massachusetts
Department of Public Health, and the Office of Healthy Homes and Lead Hazard Control at the
U.S. Department of Housing and Urban Development.
This report was based on work conducted by Battelle, with significant contributions from
Warren Strauss, Tim Pivetz, Elizabeth Slone, Jyothi Nagaraja, Nicole Iroz-Elardo, Rona Boehm,
Michael Schlatt, Darlene Wells, Jennifer Zewatsky, Michele Morara, and Bruce Buxton.
-------
TABLE OF CONTENTS
EXECUTIVE SUMMARY v
1.0 INTRODUCTION 1
1.1 Background and Purpose of Study 1
1.2 Study Objectives 2
1.2.1 Objective 1 - Combine and Manage Multiple Data Sources 2
1.2.2 Objective 2 - Conduct Analyses to Identify Predictive Variables and
Model Children's Blood-Lead Levels 2
1.2.3 Objective 3 - Develop Visualization Tool to Graphically Model Predicted
Blood-Lead Levels 2
2.0 STUDY METHODOLOGY 3
2.1 General Approach 3
2.2 Data Management 3
2.3 Descriptive Data Analyses 4
2.4 Development of Multivariate Statistical Models 10
2.4.1 Statistical Models for the Broad Coverage - Low-Resolution Model 10
2.4.2 Statistical Models for the High-Resolution Model within Massachusetts 12
3.0 DATA SOURCES AND DATABASE DEVELOPMENT 14
3.1 Children's Blood-Lead Measurements 14
3.2 Demographic Data 16
3.3 Environmental Data 21
3.3.1 Concentrations of Lead in Air 22
3.3.2 Toxics Release Inventory Data 23
3.3.3 Water Quality Data 24
3.4 Programmatic Data 24
3.4.1 Programmatic Funding Variables 25
3.4.2 EPA Region 25
3.4.3 Housing Inspection Data (Massachusetts) 25
3.5 Data Linkages 27
4.0 EXPLORATORY DATA ANALYSES 29
4.1 Relationship between National Blood-Lead Data and Explanatory Variables 29
4.2 Relationship between Local Blood-Lead Data and Explanatory Variables 43
5.0 STATISTICAL MODELING RESULTS 47
5.1 Low-Resolution Modeling Results 47
5.2 High-Resolution Modeling Results 59
6.0 GRAPHICAL PRESENTATION OF MODELING RESULTS 70
6.1 Maps of Observed and Predicted Blood-Lead Outcomes 70
6.2 Visualization Tool Development 70
7.0 DISCUSSION AND FUTURE WORK 77
7.1 Major Findings 77
7.2 Comparison of National Results and NHANES 78
7.3 Data Issues 80
7.3.1 Biases from Geocoding 80
7.3.2 Reporting Limits in Surveillance Data 80
7.3.3 Selection Bias in Surveillance Data 81
7.3.4 Limitations of Ecological Models for Predicting Within-Area Relationships 81
i
-------
7.3.5 Use of 2000 Census Data and Other Time Invariant Data as Predictors 82
7.4 Model Validation Issues 82
7.5 Other Recommendations for Immediate Future Work 84
8.0 REFERENCES 87
Appendix A Exploratory Analysis Summary Pages A-1
Appendix B Massachusetts Data: Exploratory Analysis Summary Pages B-1
Appendix C Detailed Exploratory Analyses of 95th and 99th Percentile Variables
In National Models C-1
Appendix D Detailed Discussion of National Exploratory Analyses D-1
Appendix E Detailed Discussion of Massachusetts Exploratory Analyses E-1
Appendix F U.S. Counties and Massachusetts Census Tracts with Highest Predicted BLLs F-1
Appendix G Detailed Maps of National and State Model Outputs G-1
Appendix H Data Dictionaries for National and Massachusetts Databases H-1
LIST OF TABLES
Table 3-1 Initial Variables for Analysis Created From the 2000 Census 17
Table 4-1 Summary of Exploratory Analysis Fit as shown by -2 Log Likelihoods for Pr(PbB > 5
ug/dL) Models, National Data 31
Table 4-2 Summary of Exploratory Analysis Fit as shown by -2 Log Likelihoods for Pr(PbB > 10
ug/dL) Models, National Data 34
Table 4-3 Summary of Exploratory Analysis Fit as shown by -2 Log Likelihoods for Pr(PbB > 15
ug/dL) Models, National Data 37
Table 4-4 Summary of Exploratory Analysis Fit as shown by -2 Log Likelihoods for Pr(PbB > 25
ug/dL) Models, National Data 40
Table 4-5 Summary of Log-likelihood Ratios from each Model Fit to all Potential Explanatory
Variables, Massachusetts Data 44
Table 5-1 Summary of Variables Included in Final National Multivariate Model 50
Table 5-2 Model 1 (Proportion >5 |j,g/dL) Parameter Estimates for Multivariate National Model 51
Table 5-3 Model 2 (Proportion >10 |j,g/dL) Parameter Estimates for Multivariate National
Model 53
Table 5-4 Model 3 (Proportion >15 |j,g/dL) Parameter Estimates for Multivariate
National Model 55
Table 5-5 Model 4 (Proportion >25 |j,g/dL) Parameter Estimates for Multivariate
National Model 57
Table 5-6 Summary of Variables Included in Final Massachusetts Multivariate Model 61
Table 5-7 Massachusetts Multivariate Model Estimates 62
11
-------
LIST OF FIGURES
Figure 5-1 Histograms of Residuals from Fitted National Multivariate Model 1 52
Figure 5-2a Plot of National Multivariate Model Predicted Values versus Observed with Fitted
Regression Line and 45° Reference Line for Proportion of Children with BLL > 5 ug/dL. 52
Figure 5-2b Plot of National Multivariate Model Predicted Values versus Observed with Fitted
Regression Line and 45° Reference Line for Proportion of Children with
BLL>5 ug/dL (Logit Scale) 52
Figure 5-3 Histograms of Residuals from Fitted National Multivariate Model 2 53
Figure 5-4a Plot of National Multivariate Model Predicted Values versus Observed with Fitted
Regression Line and 45° Reference Line for Proportion of Children with
BLL>10ug/dL 53
Figure 5-4b Plot of National Multivariate Model Predicted Values versus Observed with Fitted
Regression Line and 45° Reference Line for Proportion of Children with
BLL > 10 ug/dL (Logit Scale) 53
Figure 5-5 Histograms of Residuals from Fitted National Multivariate Model 3 56
Figure 5-6a Plot of National Multivariate Model Predicted Values versus Observed with Fitted
Regression Line and 45° Reference Line for Proportion of Children with
BLL>15ug/dL 56
Figure 5-6b Plot of National Multivariate Model Predicted Values versus Observed with Fitted
Regression Line and 45° Reference Line for Proportion of Children with
BLL > 15 ug/dL (Logit Scale) 56
Figure 5-7 Histograms of Residuals from Fitted National Multivariate Model 4 58
Figure 5-8a Plot of National Multivariate Model Predicted Values versus Observed with Fitted
Regression Line and 45° Reference Line for Proportion of Children with
BLL>25ug/dL 58
Figure 5-8b Plot of National Multivariate Model Predicted Values versus Observed with Fitted
Regression Line and 45° Reference Line for Proportion of Children with
BLL > 25 ug/dL (Logit Scale) 58
Figure 5-9 Histograms of Residuals from Fitted Massachusetts Multivariate Model 1 65
Figure 5-10 Plots for Predicted versus Observed Values with 45° line from Fitted Massachusetts
Multivariate Model 1 65
Figure 5-11 Histograms of Residuals from Fitted Massachusetts Multivariate Model 2 66
Figure 5-12 Plots for Predicted versus Observed Values with 45° line from Fitted Massachusetts
Multivariate Model 2 66
Figure 5-13 Histograms of Residuals from Fitted Massachusetts Multivariate Model 3 67
Figure 5-14a Plot of Massachusetts Multivariate Model Predicted Values versus Observed with
Fitted Regression Line and 45° Reference Line for Proportion of Children with
BLL>5ug/dL 67
Figure 5-14b Plot of Massachusetts Multivariate Model Predicted Values versus Observed with
Fitted Regression Line and 45° Reference Line for Proportion of Children with
BLL > 5 ug/dL (Logit Scale) 67
Figure 5-15 Histograms of Residuals from Fitted Massachusetts Multivariate Model 4 68
Figure 5-16a Plot of Massachusetts Multivariate Model Predicted Values versus Observed with
Fitted Regression Line and 45° Reference Line for Proportion of Children with
BLL>10ug/dL 68
Figure 5-16b Plot of Massachusetts Multivariate Model Predicted Values versus Observed with
Fitted Regression Line and 45° Reference Line for Proportion of Children with
BLL > 10 ug/dL (Logit Scale) 68
Figure 5-17 Histograms of Residuals from Fitted Massachusetts Multivariate Model 5 69
Figure 5-18a Plot of Massachusetts Multivariate Model Predicted Values versus Observed with
Fitted Regression Line and 45° Reference Line for Proportion of Children with
BLL>15ug/dL 69
in
-------
Figure 5-18b Plot of Massachusetts Multivariate Model Predicted Values versus Observed with
Fitted Regression Line and 45° Reference Line for Proportion of Children with
BLL > 15 ug/dL (Logit Scale) 69
Figure 6-1 Observed and Predicted Proportion of Children with Blood-Lead Levels > 10 ug/dL in the
U.S. by County, 2000 and 2005 71
Figure 6-2 Observed and Predicted Proportion of Children with Blood-Lead Levels > 10 ug/dL
in Region V by County, 2000 and 2005 72
Figure 6-3 Observed and Predicted Proportion of Children with Blood-Lead Levels > 10 ug/dL
in Massachusetts by Census Tract, 2000 and 2005 73
Figure 6-4 Response Surface of Predicted Geometric Mean Blood-Lead Concentration Across
the State of Illinois from the Visualization Tool 75
Figure 6-5 Time Series Plot of Observed and Predicted Geometric Mean Blood-Lead
Concentration in Cook County Illinois from the Visualization Tool 76
Figure 7-1 Comparison of National Surveillance Data to NHANES Data 79
IV
-------
EXECUTIVE SUMMARY
This pilot study seeks to develop statistical models to predict risk of childhood lead poisoning
within specified geographic areas based on a combination of demographic, environmental, and
programmatic information sources. Exposure factors associated with childhood lead poisoning
were investigated within census tracts for a community-focused set of models in Massachusetts,
as well as within counties across the United States in a series of national models. Aggregated
summary measures of the proportion of children screened at or above 5, 10, 15 and 25 ug/dL
within defined geographic areas (census tracts and counties) were used as the response variable
in the statistical models. These summary measures were constructed at 3-month (quarterly)
intervals from 1995 through 2005, in counties across the nation using data from CDC's National
Surveillance Database, and from 2000-2006 in census tracts within the Commonwealth of
Massachusetts based on data provided by the Massachusetts Department of Public Health.
The results of this study suggest that longitudinal predictive models can be developed at the
county level across the nation, based on the use of quarterly summary information from CDC's
National Surveillance Database, and at the census-tract level within states that have a long
history of universal screening and reporting, such as Massachusetts. These models can be used
to describe how risk of childhood lead poisoning changes over time within different regions of
the country, as well as within small geographic areas within states (e.g., counties) and even
smaller geographic areas within counties (e.g., census tracts). They can be used to predict the
risk of childhood lead poisoning in counties (or census tracts) with little or no surveillance data,
and also can be used to identify those counties (or census tracts) that are at highest risk at the end
of the period of observation.
The statistical model chosen (a random effects model with separate intercepts and slopes
estimated within each county or census tract) also allows ranking of geographic areas based on
the rate of decline over time after accounting for the fixed-effects variables of the model
(although only among those areas that provided adequate surveillance data). These random-
effects models were fit to the exceedance proportions within the context of a logistic regression
model. Within the context of the Broad-Based National Model, these random effects allow EPA
to identify those counties that are experiencing a more rapid reduction in risk of childhood lead
poisoning over time (to identify best practices) and those counties that are experiencing a
significantly less rapid decline over time (to identify areas in need of additional attention and
resources for combating lead poisoning), after already accounting for the demographic,
programmatic, and environmental factors included in the multivariate model.
Within the series of national models at the county level of geographic specificity, the data
suggest that there are significant differences in the distribution of childhood blood-lead
concentrations among the different regions of the country, and that the manner in which these
distributions change over time and are impacted by seasonality also is regionally specific. The
risk of childhood lead poisoning had a statistically significant downward trend over time in all
areas of the country.
After accounting for these regional differences, a number of demographic, environmental, and
programmatic variables were found to be highly predictive of childhood blood-lead
-------
concentrations among the different response variables modeled within this project. The specific
variables that were found to be predictive within the multivariate models varied based on the
response variable; however, there were certainly some variables that were found to be predictive
in multiple models. In addition to various census demographic variables that were identified in
previous risk modeling efforts (e.g., age of housing, percent single parent families,
race/ethnicity), air modeling data, variables constructed from EPA's Safe Drinking Water
Information System, and programmatic funding information from HUD and CDC were found to
be highly predictive in the multivariate models.
Within the context of the high-resolution model developed using data from the Commonwealth
of Massachusetts, a highly significant downward trend in the risk of childhood lead poisoning
also was identified among the five models developed. Due to a very small number of children
observed at or above 25 ug/dL within Massachusetts over the 2000-2006 period of observation,
we were unable to fit this sixth model. After accounting for the long-term reduction over time
and seasonality using similar methods that were employed in the Broad-Based National Model,
only the demographic and programmatic variables were included in the multivariate models for
risk of childhood lead poisoning at the census-tract level. Of particular interest were the
variables that described the proportion of housing units within each census tract that were found
to be in compliance and out of compliance with the Massachusetts Standard of Care. In all five
of the multivariate models, the risk of childhood lead poisoning was significantly reduced as the
proportion of housing units in compliance increased within a census tract. In addition, for the
last two models (which predicted proportion of children at or above 10 and 15 ug/dL), the risk of
childhood lead poisoning increased significantly as the proportion of housing units out of
compliance increased within a census tract.
The observed and predicted values from the multivariate models (including predicted values
where there were no observed surveillance data) were used to generate static maps using Arc-
View software, and were loaded into a customized dynamic visualization tool that allows users
to interact with the modeling results to assess how risk of childhood lead poisoning changes over
time within specific regions of the country. This tool will help EPA and others identify areas
that remain at risk for childhood lead poisoning as we approach the 2010 goal of elimination of
this preventable adverse health outcome.
VI
-------
1.0 INTRODUCTION
1.1 Background and Purpose of Study
Over the past 15 years, various childhood lead poisoning prevention programs (CLPPPs)
throughout the United States have conducted analyses of their screening data to develop "risk
indices," or mathematical models for predicting the prevalence of childhood lead poisoning in
different geographic areas within their regions of concern. These modeling efforts generally are
intended to characterize the extent of the prevalence of childhood lead poisoning within their
geographic areas and to support the development of targeted screening and outreach plans in
order to reach the 2010 goal of eliminating childhood lead poisoning throughout the United
States.
To date, the majority of modeling efforts have focused on combining blood-lead testing
information and demographic data available from the U.S. Census. Previous studies have
combined childhood surveillance data (aggregated at the zip-code or census-tract level) with
demographic predictor variables from the Census Bureau for the purposes of targeting
geographic areas at higher risk of childhood lead poisoning (Miranda, Dolinoy, and Overstreet
2002; Miranda et al. 2005; Strauss et al. 2001a). These studies have led to recommendations for
using age of housing and percent of population below the poverty line for targeting
neighborhoods that may be of increased risk for childhood lead poisoning (CDC 1997).
Numerous studies also have been used to document the relationship between children's blood-
lead concentrations and measures of lead in residential environmental media (dust, soil, air,
water, and food) (HUD 1995; Lanphear et al. 1998; Strauss et al. 2001b). These studies have
contributed to EPA and HUD regulations and policies for identifying and reducing residential
childhood lead exposures (24 CFR Part 35; 40 CFR Part 745; 40 CFR Part 745; U.S. Department
of Housing and Urban Development September 15, 1999). Other studies have combined blood-
lead surveillance data with programmatic information on housing units treated to determine the
positive impact of housing-based intervention programs (Strauss et al. 2006).
The goal of this study is to explore models based on a hierarchical combination of demographic,
environmental, and programmatic information sources in order to predict the number of children
at risk of elevated blood-lead levels for a given geographic area. While the models are highly
dependent on available data, this study provides a statistical methodology that combines each
data source in an appropriate manner, adjusting for global and local trends over time. In doing
so, the models build upon concepts of hierarchical modeling and longitudinal data analysis.
As EPA, CDC, and other federal and state agencies prepare to meet the 2010 goal of eliminating
childhood lead poisoning, this pilot study of integrating several different types of data sources
hopefully improves the predictive power of models that rely on a single information source. This
allows for more efficient targeting of those geographic areas that need the most help in
eliminating childhood lead poisoning.
-------
1.2 Study Objectives
1.2.1 Objective 1 - Combine and Manage Multiple Data Sources
The first objective of the study was to combine multiple sources of information in order to assess
the impacts of various factors on children's blood-lead levels. The study had to obtain and
manage data relating to blood-lead levels, environmental exposure, demographic characteristics,
and programmatic support to state and local childhood lead-poisoning prevention efforts.
Missing, incomplete, or error-prone data were identified for each data source and steps were
taken to resolve data problems. Databases were developed to store and later combine each data
source in a manner that supported the development of predictive models. Master databases that
integrated multiple data sources were developed to enable efficient access to data required for
statistical analyses. A data dictionary was prepared to document the various study databases.
1.2.2 Objective 2 - Conduct Analyses to Identify Predictive Variables and Model
Children's Blood-Lead Levels
The second study objective was to conduct statistical analyses in order to develop models that
are predictive of risk of childhood lead poisoning within defined geographic areas as a function
of various different environmental, programmatic, and demographic factors. As part of this
objective, a National model was developed for predicting risk at the county level based on
surveillance data from the U.S. Centers for Disease Control and Prevention (CDC), and a local
model was developed at the census-tract level using blood-lead surveillance data from within the
Commonwealth of Massachusetts. As part of the model building process at both the national and
local levels, the various data sources underwent exploratory analyses to investigate data
distributions, identify relationships between variables, and determine appropriate variables to
include in subsequent statistical models. Part of the exploratory analyses included an effort to
identify which environmental, programmatic, and demographic factors were most predictive of
risk of childhood lead poisoning. Multivariate statistical models then were developed using
appropriate statistical software to combine the various data sources within a single model that
accounted for trends in risk of childhood lead poisoning over time within defined geographic
areas. Model diagnostics were reviewed, and models with the best fit were identified.
1.2.3 Objective 3 - Develop Visualization Tool to Graphically Model Predicted Blood-
Lead Levels
The third study objective was to develop an appropriate visualization tool that allows users to
interact with the results of the statistical model predicting children's blood-lead levels across the
United States. This tool provides the user with the flexibility to visually compare the predicted
blood-lead levels across areas of the country and also to drill down into individual counties or
census tracts to assess the input data that generated the predicted value.
-------
2.0 STUDY METHODOLOGY
2.1 General Approach
This pilot study sought to develop models to predict the number of children at risk of elevated
blood-lead levels for a given geographic area based on a hierarchical combination of
demographic, environmental, and programmatic information sources. Doing so required looking
at both the mechanisms of childhood lead risk assessment and control activities at the local level
as well as at broad trends across the United States. The two main analysis goals correspond to
developing predictive models at two different levels of geographic specificity, and appear as
follows:
1. Broad Coverage (Low-Resolution) Model: This type of model is intended to be able
to characterize broad trends over time in the prevalence of childhood lead poisoning
at the county level across the entire United States. This model was based on quarterly
county-level aggregated surveillance data from the CDC and augmented with
environmental data from a variety of sources, demographic data from the U.S.
Census, and programmatic (level of federal funding) information.
2. High-Resolution Model: This type of model represents the effort to assess the
relative contribution of various exposure sources associated with elevated blood-lead
concentrations within select communities. This type of model certainly reflects the
idea that exposures that contribute to childhood lead poisoning are likely to be
community specific. This analysis goal was met through modeling census-tract level
surveillance data within Massachusetts as well as housing unit lead assessment and/or
control activities. These data sources were augmented with all of the environmental,
demographic, and programmatic information used in the national model with the
addition of state programmatic funding levels.
The primary objective of this pilot study was to utilize combined information from different
sources at various levels of geographic and temporal specificity to more accurately target
geographic areas at high risk for not meeting the 2010 goal of eliminating childhood
lead-poisoning. As such, the study required careful integration of a variety of data sources with
various characteristics and documentation. Data to support this study were gathered from
multiple sources, including federal, state and local lead poisoning prevention programs, as well
as publicly available data that were downloaded from the internet (e.g., census data, EPA's
Toxics Release Inventory).
2.2 Data Management
When each data source was received, the data and supporting documentation were reviewed to
gain knowledge on the structure, relationship, and quality of the data. Database managers
worked with the project team to determine the final format for each database, the desired uses of
the databases; as well as the requirements for maintaining the databases. Based on this
information, master databases were constructed in SQL server for both the national low-
resolution model and for the high-resolution model based on Massachusetts data that integrated
the various environmental, demographic, and programmatic variables, and facilitated statistical
-------
analyses of the combined data. These datasets were translated directly to SAS datasets for
statistical analysis, and also were transferred to Microsoft Access for delivery to EPA. The
Microsoft Access database includes a compact version of each database utilized in the statistical
analysis, with any extraneous variables removed. In addition, the Microsoft Access database
includes a copy of the integrated longitudinal dataset used to support the final multivariate
models developed within this project.
Throughout the development process, checks for completeness were conducted on all study
databases, and the project team worked with data-sharing collaborators and EPA to attempt to
complete missing data as necessary to complete the proposed statistical analyses. Any changes
to the databases (corrections, additions, deletions, etc.) were documented in appropriate meta-
data files, and reported to EPA within the data dictionary attached to this report as Appendix H.
As part of constructing and maintaining these databases, the project team will develop
appropriate documentation of the combined master databases.
Standard Operating Procedures (SOPs) were followed to ensure the proper storage, backup, and
retrieval of datasets created and analyzed for this study. Additional details of these SOPs can be
found in the Quality Management Plan prepared for this project (Battelle 2007).
2.3 Descriptive Data Analyses
The analysis began with an assessment of the study sample, i.e., the proportion of counties and
census tracts in the sample with complete data for both the response variable and the explanatory
variables. Prior to the fitting of any descriptive statistics to assess the predictive ability of any of
the explanatory variables, the blood-lead response variables needed to be constructed based on
the CDC and Massachusetts blood-lead surveillance data. These data sources contain
information on individual blood-lead testing results on children, and were aggregated into
quarterly summary statistics (number of children observed, arithmetic and/or geometric mean1,
and number of children observed at or above 5, 10, 15, and 25 ug/dL) at the county level (for the
CDC data) and the census-tract level (for the Massachusetts data). An executable was developed
to extract these quarterly summary statistics from each county from CDC's SQL server database
for children aged 6-36 months, and a similar executable was deployed to create parallel summary
statistics at the census-tract level for the Massachusetts surveillance data. Because of
confidentiality restrictions, county/quarter (or census tract/quarter) combinations with fewer than
5 observations were automatically eliminated from the dataset. Data reported prior to 1995 also
were eliminated from the analysis database prior to statistical analysis.
Once the aggregated summary datasets were constructed, they were reviewed for possible
problems associated with childhood lead poisoning prevention programs not following universal
reporting protocols (for some localities, data were only transmitted to the CDC National
Surveillance Database for children with elevated blood-lead concentrations over certain periods
of time). A screening algorithm was developed to remove these suspect data from the analysis
dataset - resulting in the elimination of less than 3 percent of the aggregate summary records
from the National database. The screening algorithm also was applied to the Massachusetts data
- however no records were eliminated, as Massachusetts was following universal screening and
The CDC reported only the arithmetic mean, while Massachusetts reported both arithmetic and geometric means.
-------
reporting guidelines over the 2000-2006 time period for which they provided data. Additional
detail on the manner in which the blood-lead response variables were constructed can be found
in Section 3.1.
In preparation for developing longitudinal statistical models, univariate summaries of each
variable as a function of time were generated and comparisons were made of these distributions
using side-by-side box-plots for continuous data or bar-charts for categorical data. This helped
verify that the data were clean and ready for analysis and identified cells with sparse data. Such
descriptive analyses were conducted on each database, to characterize the distributions of all
observed variables using frequency distributions for categorical variables, and simple summary
statistics (mean, median, mode, minimum, maximum, and select percentiles) for continuous
variables. Distributional assumptions also were explored for certain variables, as appropriate, in
preparation for more sophisticated models. For example, some environmental concentration data
may depart from normality, and follow a log-normal distribution. In these cases, we additionally
reported the geometric mean and geometric standard deviation as part of the simple descriptive
summary.
The univariate descriptions then were followed by fitting a series of cross-sectional bivariate
relationships between the blood-lead response variable(s) and each candidate explanatory
variable. These cross-sectional relationships were explored as a function of time to better
understand the stability of these relationships, and whether they change over time, so that they
can be modeled appropriately in the more sophisticated longitudinal analyses. These analyses
also will help identify which explanatory variables are most predictive of the blood-lead
response variable.
In preparation for more sophisticated statistical analyses, such as the Generalized Linear Mixed
Logistical Regression Model outlined below, relevant stratified analyses were performed to
investigate interactions discussed in the data analysis plan. For example, the population density
variable was investigated in this manner, as density may serve as a surrogate to differentiate
between rural and urban geographic areas in the analyses - and exposure variables may be
different in these types of areas. Similarly, EPA regions were investigated as a potential
stratification variable. If variation in the measure of effect is not observed (e.g., odds ratios)
across the levels of a third variable; however, the third variable can likely be treated as a
potential confounder in the multivariate model, rather than as an effect modifier. If the odds
ratios differ markedly—e.g., the effect appears to be protective in one subgroup and hazardous in
another subgroup—the third variable must be considered as an effect modifier.
Specific variables within each type were explored using four general approaches - (1)
histograms or side-by-side box-plots of the candidate explanatory variable, (2) simple regression
line plots exploring the relationship between predicted risk of lead poisoning and the explanatory
variable for each of the four specified time periods, (3) distributional summaries of the
explanatory variable across the three time periods, and (4) statistical modeling of the relationship
between the explanatory variable and various blood-lead response variables after adjusting for
the effects of time and seasonality within different regions of the country for the National (Low-
Resolution) model and for the effects of time in the Massachusetts (High-Resolution) Model.
-------
Histograms or Side-by-Side Box-Plots of Potential Explanatory Variables
Using one record for each quarterly county- or census-tract -level data point, a histogram
illustrating the distribution of the explanatory variable is presented. A fitted line assists with
assessing the distribution of each potential explanatory variable (e.g., whether the data are
approximately normally distributed). Histograms were plotted for potential predictor variables
that were time invariant. For predictor variables that varied over time within the analysis dataset,
side-by-side box-plots were used to characterize the distribution over the time periods, using an
average of the predictor variable across the quarters in which blood-lead concentrations were
observed within each time-period and area.
Logit Probability Plots for each Explanatory Variable
The county-level quarterly proportion of screened children exceeding 10 |j,g/dL reported by the
CDC were modeled as a function of each candidate explanatory variable, with separate logit
curves used to represent each of the time periods. This analysis allows comparison of the
relationship between the explanatory variable and predicted blood-lead level trends across time
periods. If the relationship is stable across time, roughly parallel curves are evident. If the effect
of the variable on blood-lead varies over time, non-parallel (and perhaps intersecting) curves are
observed. In this case, the longitudinal analyses may need to be adjusted to allow for the effect
of the covariate to change over time.
Plots of Predicted GM Blood-Lead Levels and Explanatory Variables
The census-tract-level quarterly blood-lead data available from Massachusetts were fit to each
explanatory variable to generate predicted GM blood-lead levels across the range of the
explanatory variable for each of the time periods. A simple linear regression line plot
summarizes this analysis with one line for each time period. This analysis allows comparison of
the relationship between the explanatory variable and predicted blood-lead level trends across
time periods. If the relationship is stable across time, roughly parallel lines are evident. If the
effect of the variable on blood lead varies over time, non-parallel (and perhaps intersecting) lines
are observed. In this case, separate slopes may need to be fit for these variables over different
periods of time in the more sophisticated longitudinal analyses.
Distributional Summaries
The first table presented for each explanatory variable contains a series of summary statistics for
each of the time periods including sample size, number missing, mean, and standard error. The
sample size is relative to the number of quarters represented in the analysis dataset; therefore,
these distributions correspond to the analysis dataset (and not necessarily to the distribution of
the variable across the nation or state). The distribution of the data for each time period also is
presented (minimum, median, and maximum and 10th, 25th, 75th, and 90th percentiles).
Comparing the summary data across time allows assessment of changes in the explanatory
variable over time for the groups of tracts included in the analysis for each time period.
Generally, the mix of counties and Massachusetts census tracts included in each of the time
periods is similar, so that the distribution of the data from each period also is similar.
-------
Statistical Modeling of Relationship between Explanatory Variables and Exceedance of
Blood-Lead Thresholds for the National (Low-Resolution) Model:
For each candidate predictor variable being considered for the National (Low-Resolution)
Model, the following generalized linear mixed models approach was used to model the
proportion of children exceeding certain thresholds as a function of the predictor variable after
adjusting for Region-specific intercepts, slopes overtime and effects of seasonality:
\ogit(E[Yl} /n,j]) = Regionlk • (/3ok + /3lk • tl} + 02k • Seasonl}) + ^Xl} + S0l + Su • tl}
Where (i indexes county, j indexes time, and k indexes the region of the country), Yy
represents the number of children observed above the blood-lead threshold in the 1th
county at time j, n;j represents the number of children tested in the ith county at time j,
ty and Season;j are fixed effects variables corresponding to a time-trend (in years) and
seasonality, Xy is the candidate predictor variable being investigated, the beta
parameters (P) represent a vector of fixed effects, and the delta parameters (5)
represent random effects that allow each county to have its own trend over time. The
Xy predictor variable is mean centered in this series of models, allowing the intercept
term to be relatively stable across the multiple predictor variables being investigated.
In this model, it can be assumed that 5o; and 5n jointly follow a multivariate normal
distribution with mean zero and covariance matrix E = CT|
oo
• Model 1 follows the above approach - where Yy represents the number of children
observed with blood-lead concentrations at or above 5 ug/dL, and ny represents the
total number of children screened within each record (county/quarter).
• Model 2 follows the above approach - where Yy represents the number of children
observed with blood-lead concentrations at or above 10 ug/dL, and ny represents the
total number of children screened within each record (county/quarter).
• Model 3 follows the above approach - where Yy represents the number of children
observed with blood-lead concentrations at or above 15 ug/dL, and ny represents the
total number of children screened within each record (county/quarter).
• Model 4 follows the above approach - where Yy represents the number of children
observed with blood-lead concentrations at or above 25 ug/dL, and ny represents the
total number of children screened within each record (county/quarter).
In addition to the above models, the project team explored whether the effect of each candidate
predictor variable on the exceedence proportions varied over time. This was done by exploring
the interaction between each candidate predictor variable and (1) a linear effect of time, (2) a
quadratic effect of time, and (3) a 4-level categorical effect of time.
-------
Statistical Modeling of Relationship between Explanatory Variables and Exceedance of
Blood-Lead Thresholds for the Regional (High-Resolution) Model:
The Regional (High-Resolution) Models developed for the Massachusetts data at the census-tract
level of geographic specificity included models for both continuous data (geometric mean) and
binomial data (exceedence proportions). Therefore, each explanatory variable being considered
for these models were explored using models for both continuous and binomial data as described
below:
Continuous Data: The following mixed models analysis of variance (i.e., a random-effects
model for continuous data) was used to model the geometric mean (GM) blood-lead
concentration as a function of a candidate predictor variable:
Where (i indexes census tract, j indexes time), GMy represents the geometric mean
blood-lead concentration in the ith census tract at time j, ty is a fixed-effects variable
corresponding to a time-trend (in years), Xy is the candidate predictor variable being
investigated, the beta parameters (P) represent a vector of fixed effects, and the delta
parameters (5) represent random-effects that allow each county or census tract to have
their own trend over time. The Xy variable typically is mean centered in this series of
models, allowing the intercept term to be relatively stable across the multiple
predictor variables being investigated. In this model, it can be assumed that 5o; and
Si; jointly follow a multivariate normal distribution with mean zero and covariance
matrix E =
°"
oo
, and the residual error also is assumed to follow a normal
10 °11.
distribution with mean zero and variance o^-^-
• Model 1 follows the above approach - where the responses are weighted equally.
• Model 2 follows the above approach - where the responses (GM) are weighted by the
number of children observed (screened) within each record (census tract/quarter).
Binomial Data: The following generalized linear mixed model (i.e., a random-effects model for
binomial data) was used to model the proportion of children exceeding certain thresholds as a
function of a candidate predictor variable:
log it(E[Y1} I ntj ]) = A, + A • ttj + /72 • Seasonl} + ^XtJ + 8Qi + 8lt • ttj
Where (i indexes census tract and j indexes time), Yy represents the number of
children observed above the blood-lead threshold in the 1th census tract at time j, ny
represents the number of children tested in the ith census tract at time j, ty is a fixed
effects variable corresponding to a time-trend (in years), Xy is the candidate predictor
variable being investigated, the beta parameters (P) represent a vector of fixed effects,
and the delta parameters (5) represent random effects that allow each census tract to
have its own trend over time. The Xy variable also is mean centered in this series of
-------
models, allowing the intercept term to be relatively stable across the multiple
predictor variables being investigated. In this model, it can be assumed that 5o; and
Si; jointly follow a multivariate normal distribution with mean zero and covariance
matrix E =
oo
• Model 3 follows the above approach - where Yy represents the number of children
observed with blood-lead concentrations at or above 5 ug/dL, and ny represents the
total number of children screened within each record (census tract/quarter).
• Model 4 follows the above approach - where Yy represents the number of children
observed with blood-lead concentrations at or above 10 ug/dL, and ny represents the
total number of children screened within each record (census tract /quarter).
• Model 5 follows the above approach - where Yy represents the number of children
observed with blood-lead concentrations at or above 15 ug/dL, and ny represents the
total number of children screened within each record (census tract /quarter).
• Model 6 follows the above approach - where Yy represents the number of children
observed with blood-lead concentrations at or above 25 ug/dL, and ny represents the
total number of children screened within each record (census tract /quarter).
To allow comparison of the different variables explored within each variable type, tables are
included in Section 4 that present the log-likelihood statistic from each model run and presented
in Appendices A and B. Within each variable category, the variable that provided the best fit for
each of the six models is highlighted in yellow. To ensure compatibility in the likelihood-based
statistics being used to make comparisons among the different candidate predictor variables,
missing values for predictor variables were imputed using the mean of the distribution. The
number of imputed values that were necessary is provided by the nmiss column in the table of
distributional summaries described above. The project team choose whether to adjust for
changes in the slope for a candidate predictor variable over time based on a comparison of the
likelihood statistics after adjusting for the number of degrees of freedom used in the model for
the effects of the explanatory variable (over time) on the response. Those variables highlighted
in yellow have the largest likelihood statistic after adjusting for differences in the degrees of
freedom, and were considered as strong candidate predictors for the multivariate statistical
models.
Due to the iterative nature and complexity of the Mixed Models Analysis of Variance and
Generalized Linear Mixed Modeling Approaches, these models did not always converge.
Models that failed to converge for a particular predictor variable are discussed in the results
sections within Appendices D and E, and also are indicated in Tables 4-1 and 4-2, as well as in
the summary pages of Appendices A and B by blank cells. Cases in which model convergence is
not attained likely will translate to exclusion of that particular variable when building the
multivariate model. Note that because of the sparseness of data for children with blood-lead
levels at or above 25 ug/dL within the Massachusetts data, Model 6 failed to converge across all
variables. Thus, Model 6 results are not presented or discussed for the Massachusetts models.
-------
2.4 Development of Multivariate Statistical Models
2.4.1 Statistical Models for the Broad Coverage - Low-Resolution Model
This model is being used to characterize broad trends over time in the prevalence of childhood
lead poisoning across the entire United States. The various surveillance, environmental sources,
demographic characteristics, and programmatic support data sources were aggregated to the
county level for all localities with universal screening and reporting. Quarterly estimates of each
candidate predictor variable were created for each county within the United States, including
those county/quarter combinations that did not include observed blood-lead response variable
information (allowing for the extrapolation of the model predictions to geographic areas and
time-points that were not represented within CDC's National Surveillance Database).
In addition to investigating the predictive ability of each potential environmental, programmatic
and demographic variable as described earlier, various different stratification variables (region of
the country, population density) and covariates (time trend and seasonality) were investigated.
As a result, all four multivariate statistical models adjust for a categorical variable that
differentiates among the risk of childhood lead poisoning within the 10 EPA regions. Within
each EPA region, a separate intercept, trend over time, and seasonality term (based on fitting
intercepts for each quarter of time) was included in the multivariate statistical model. For the
purposes of discussion, it was assumed that the modeling approach will focus on a logistic
regression model for the proportion of children that have elevated blood-lead concentrations
(>10 ug/dL). The temporal nature of declining childhood lead poisoning will be addressed via
classic concepts of longitudinal data modeling of the low resolution data. Let
Yy represent the number of children that were detected with blood-lead concentration above
10 ug/dL from the ith county and jth point in time (quarter),
n;j represent the number of children that had their blood-lead concentration tested from within
the ith county and jth point in time (quarter),
Please note that we expect that n^Ny, where Ng represents the total population of children
in the ith county and/ point in time.
ty represent time (in years) corresponding to the Yy response variable,
Region^ represents the region of the country that the ith county is located within (where
k==l,.. .,10 and is representative of the 10 established EPA regions),
Seasony represents a series of 4 indicator variables that differentiate between the 4 different
quarters captured by the j-index, and
Xy represent a series of predictor variables associated with the Yy response variable. These
predictor variables may represent air monitoring data, drinking water data, census
demographic data, programmatic data on federal financial support for lead poisoning
prevention, and other related information as detailed above that can potentially help predict
the prevalence of lead poisoning at the county level.
10
-------
We introduce the following as a potential baseline model:
logit(E[Yv /nv]) = Regionlk • (fiQk + J3lk • ttj + ]32k • Seasont]) + p^.Xtj + 6Qt + Slt • ttj
Where the beta parameters (P) represent a vector of fixed effects, and the delta parameters (5)
represent random effects that allow each county to have their own trend over time. In this
model, it can be assumed that 5o; and SH jointly follow a multivariate normal distribution with
mean zero and covariance matrix E = CT|
oo
Counties with larger 5i parameters estimates represent areas where lead-poisoning has not
significantly decreased over time. Similarly, the parameter estimates can be used to identify
those counties with the highest predicted prevalence of childhood lead poisoning at various time
points in the future.
In building the multivariate statistical model for the Broad-Based Modeling Objective, the
project team first evaluated the predictive ability of each candidate predictor variable that was
considered within the exploratory analyses. For the environmental predictor variables, in
particular (information from EPA's 1999 National Air Toxics Assessment, Safe Drinking Water
Information System, and Toxics Release Inventory), the data were largely concentrated at zero.
Therefore, a series of zero/one indicator variables that represent county/quarter combinations at
or above the 95th and 99th percentile of observed values of these environmental predictor
variables within the analysis dataset also were investigated.
Once the predictive ability of each candidate variable was established within the exploratory
analyses described earlier, candidate predictor variables were classified into groups (e.g.,
housing age, income, education, air modeling, programmatic financial support) and then the
single best predictor variable within each group was selected for possible inclusion within each
of the six multivariate statistical models being developed. If the selected variable demonstrated a
relationship with risk of lead poisoning that changes over time (as evidenced by intersecting lines
in the plots generated in the exploratory analyses), then this interaction was taken into
consideration within the evaluation of the predictive ability of the candidate variable(s).
The approach to determining which environmental, programmatic, and demographic variables
were included in the model followed a backward elimination process - in which each variable
group's best predictor variable identified earlier was included in the first model - with variables
being eliminated from the model when they were not deemed to be highly significant. This
model building process also was aided by investigation of the selected environmental,
programmatic, and demographic variables for issues of potential colinearity via investigation of
correlation matrices and principal components analysis. The resulting multivariate statistical
models were parsimonious - and in most cases only included variables that were highly
statistically significant. In a few cases, a variable was left in the model without being highly
significant - because its elimination caused a large drop in the model log-likelihood (suggesting
11
-------
that the model is significantly improved with the addition of a variable whose slope is not
significantly different from zero).
After the multivariate statistical models were developed, model fit diagnostics were evaluated
and documented.
The parameter estimates for the four National Multivariate Statistical Models are provided in the
results section. The results of these models also were explored in multiple ways. Maps were
generated to demonstrate observed and predicted proportion of children at or above 10 ug/dL
within each EPA region for the Years 2000 and 2005 (data were appropriately averaged across
the four quarters in each of these years prior to mapping). Lists also were generated to identify
the 150 highest risk counties across the United States at the end of the observation period (2006)
as predicted by each of the six models, as well as the 10 highest risk counties within each state.
Finally, the predicted values from these multivariate statistical models (extrapolated to
county/quarter combinations not represented in the CDC surveillance database) were integrated
in a unique data visualization tool. The product of this effort is a time-series of maps (or a
movie) that spatially interpolates risk of childhood lead poisoning as a function of various
appropriate predictor variables. The visualization tool allows users to interact with the modeling
results at different levels of temporal and geographic specificity. The tool allows the user to
select an appropriate response variable (e.g., proportion of children with blood-lead
concentrations above 5 ug/dL) and play a movie that displays a time-series of maps that displays
how the predicted (or observed) risk changes over time across the various counties within a
selected state. The user can zoom in on a rectangular area, to see these results with a higher
degree of geographic specificity. The user also can stop the movie (or rewind, or fast-forward)
to isolate specific points in time. By using the mouse, the user also can select a specific county
and the tool will display the observed and predicted data for that particular county in a separate
window. The visualization tool was written in C++, and was built in a manner that will allow
EPA to modify the model and for the project team to quickly import the resulting data from a
modification into the tool.
2.4.2 Statistical Models for the High-Resolution Model within Massachusetts
High-resolution models will be utilized to identify the relative contribution of various types of
exposure sources in elevated risk for childhood lead poisoning within select communities within
the Commonwealth of Massachusetts. These types of sources include housing factors, broader
environmental exposure, demographic composition, and programmatic resources. While this
type of model reflects the idea that exposures contributing to childhood lead poisoning likely are
community-specific, analysis of the high-resolution models may have certain limitations
including selection bias and generalizability to other geographic areas.
The Massachusetts Department of Public Health (MDPH) entered into a limited use data sharing
agreement with the project team, allowing them to provide blood-lead testing results on
individual children (aged 6-36 months) and housing inspection data in a format that preserves
linkages through a housing unit identification variable. These data will be utilized in two
different modeling approaches. The first modeling approach will seek to develop census tract
quarterly summary measures similar to the National Model for blood-lead (e.g., exceedance
12
-------
proportions and geometric means), as well as summary measures for the proportion of housing
units in each census tract that are known to be in (or out of) compliance with the Massachusetts
standard of care (for use as a potential explanatory variable). MDPH also has provided the
project team with summary information regarding HUD and state funding of residential housing
interventions (lead hazard control and abatement) - which will be used to develop a longitudinal
summary of current and cumulative per-capita spending on residential intervention within each
census tract (using various assumptions on the allocation of such dollars). Other explanatory
variables, such as the U.S. Census, EPA Toxics Release Inventory, 1999 National Air Toxics
Assessment, and water quality data will be available for use in these models.
These census-tract level summary data (both response variable and explanatory variables) were
modeled using a similar approach to what is being proposed for the National (Low-Resolution)
Model - only the unit of clustering was census tract rather than county.
13
-------
3.0 DATA SOURCES AND DATABASE DEVELOPMENT
The main goal of the statistical analysis were to develop a series of predictive models that help
provide a better understanding of (1) the relative importance of various exposure sources in
addition to leaded paint in housing and (2) the geographic areas across the United States that
remain at increased risk for childhood lead poisoning. To do so, blood-lead data were combined
with various environmental, demographic, and programmatic datasets at different levels of
geographic specificity and coverage. A description of each of these data sources, as well as
discussion of how they were combined, is included in this section.
3.1 Children's Blood-Lead Measurements
The statistical models are based upon blood-lead levels of children corresponding to the various
geographic areas studied. To enable national analyses, CDC's Lead Poisoning Prevention
Branch provided quarterly summary data from their national surveillance database for children
aged 6-36 months within each county that had submitted data. These summary measures
included the number of children screened, percentage of children who exceeded certain blood-
lead thresholds, and arithmetic mean blood-lead concentration for state/local grantees with a
history of universal reporting.
The intention was to have the models reflect the annual prevalence of childhood lead poisoning
over time. Thus, the data were summarized so that each child could only be reported once a
year. An algorithm was developed to select representative screening test(s) for children with
multiple results with an objective of having children represented in the analysis dataset
maximally once a year. For a given patient with multiple testing results, the algorithm
preferentially selected tests confirming elevated blood-lead levels and then selected follow-up
tests taken beyond nine months of the previously selected test. Screening tests were selected
when no confirming record was available.
The response variable consists of quarterly summary statistics from 1995-2005 on the
distribution of observed blood-lead concentrations in counties across the nation, based on
information from CDC's national surveillance database. It should be noted that there likely
exists significant variation and differences in the sampling and analytical methodologies
employed in performing childhood blood-lead testing among the different counties that
contributed to the CDC dataset, and within counties over time. Sampling methods include both
capillary and venous tests, and different laboratory methods likely are represented within the data
with varying reporting limits or limits of detection. Variation in reporting limits and limits of
detection could introduce significant biases into statistical models of any continuous measures of
blood-lead concentration that could be used in statistical models, such as the geometric or
arithmetic mean blood-lead concentration. Alternatively, there was agreement among the
research team and the CDC that measures whether a testing result was found above or below
certain threshold values (5, 10, 15, and 25 ng/dL) would be more robust to these potential
reporting and detection biases. Therefore, the National (Low-Resolution) Models focus on the
proportion of screened children found above these threshold values using a logistic regression
modeling approach.
14
-------
After summarizing the test-level data by year, quarter, and county, counties that contained less
than five test records in a quarter were excluded for confidentiality reasons. The time series of
summary statistics within select counties were initially investigated to determine appropriate
exclusion criteria to ensure that the data retained for analysis represented blood-lead
concentrations that were universally reported (i.e., there were periods of time in which some
state or local childhood lead poisoning prevention programs only reported elevated blood-lead
concentrations - and these data needed to be eliminated from the analysis). Thus, the number of
quarterly summary statistics varied from county to county within the analysis dataset.
As a prelude to developing the screening algorithm for elimination of data from counties that
were not following universal reporting protocols, a subset of data from counties with obvious
non-universal reporting was identified from within the National quarterly aggregate summary
database. The algorithm was developed based on application to this subset of data prior to being
utilized on the remainder of the National Surveillance database. The algorithm is based on the
following:
Let
n;j represent the number of children observed in the ith county during the jth quarter
P90(n;) represent the 90th percentile of observed ny within the ith county
AMy represent the arithmetic mean blood-lead concentration observed in the 1th county
during the jth quarter
P50(AMi) represent the 50th percentile of observed AMy within the ith county
represent the proportion of children with blood-lead concentration observed at or
above 10 ug/dL in the ith county during the jth quarter.
Then the following 3 exclusion/inclusion criteria are applied sequentially:
Criterion #1 : If n;j < Max(P90(n;)/5, 15) and (AMy > 2* P50(AM;) or P10y> 0.75), then
exclude the data from the ith county during the fh quarter. This exclusion criterion
essentially eliminates county /quarter combinations with relatively lower screening
penetration (compared to when peak screening was achieved) and high blood-lead
concentrations. The rationale for this exclusion criterion is that the periods of time in
which a lead poisoning prevention program is not conducting universal reporting will
involve fewer reported testing results that have higher blood-lead concentrations.
Criterion #2: If n;j > 100 and AMy <7, then include the data from the ith county during the
jth quarter. This criterion was added to include a small number of county /quarter
combinations within the testing subset of data that were eliminated by the first exclusion
criteria but did not appear to be inconsistent with the remainder of data that would be
included in the analyses. This second criteria was inspected carefully upon application to
the entire set of quarterly county summary statistics from CDC's National Surveillance
database, to ensure that it was reintroducing data into the analysis in a manner consistent
with the data analysis goals.
15
-------
Criterion #3: If n;j < 100 and AM;j >10, then exclude the data from the ith county during
the fh quarter. This third criteria was established to exclude a small amount of data that
was not captured by the first exclusion criteria (mostly representing counties with a
median observed blood-lead concentration slightly above 5 ug/dL)
Within the quarterly county summary statistics from CDC's National Surveillance database,
there were 72,466 county/quarter combination-level records. Application of Criterion 1-3 above
eliminated an total of 2,351 records (3.25%) from the final analysis dataset.
To enable analyses at a finer level of geographic detail than the county level, the MDPH
provided blood-lead surveillance data on specific testing results for individual children (with
confidential identification information excluded) so that data could be summarized and reported
by census tract. The Massachusetts blood-lead surveillance data represents all children aged 6-
36 months tested from the period 2000-2006. As with the national data, quarterly census-tract-
level records were created for analysis.
Due to selection bias, it is expected that the CDC National Surveillance dataset as well as the
Massachusetts surveillance data may show higher proportions of elevated blood-lead
concentrations than found in the general population. For this reason, the proportion of children
with elevated blood-lead concentrations as well as the distribution of the potential continuous
summary measure derived from the surveillance data were compared with those reported by the
most recent six years of available CDC National Health and Nutrition Examination Survey
(NHANES). Results of this comparison are presented in Section 7.2. In the future, to account
for differences between the surveillance and NHANES data, modifications could be made to the
models to calibrate the surveillance data to better match the national distribution of childhood
blood-lead concentrations as appropriate (Strauss, 200la).
3.2 Demographic Data
Demographic information from the 2000 U.S. Census was utilized in both the high- and low-
resolution models, with data being acquired at the county level for the entire nation and at the
census-tract level for Massachusetts. The Census 2000 data gathered by the Census Bureau
includes over 1,000 variables. To narrow the scope of the project, 43 variables within 9 general
categories were selected and explored, most of which had been used previously by the project
team in a CDC-sponsored study to predict risk of elevated blood-lead concentrations at the
census tract level (Strauss, 200Ib). In many cases, the census variables are constructed from
counts or summary statistics published in the detailed Census 2000 tables. For example, within
each geographic area, the Census Bureau reported the number of houses that were built before
1950 and the median income of all households. In order for the analysis to draw comparisons
from tract to tract and/or county to county, however, the Census variables needed to be
manipulated in a fashion that depended upon the format of the variable. For example, count
variables, such as the number of housing units built before 1950, were changed to percentages.
Summary statistic variables describing income on the other hand, may be standardized within
state to adjust for between-state differences in the cost of living. Table 3-1 supplies the list of
the variables investigated within the nine categories and notes how they were calculated.
16
-------
Table 3-1. Initial Variables for Analysis Created From the 2000 Census
Variable
Group
Density
Race
Age
Family
Structure
Education
Census Variable*
Persons
Housing units
White population
Black population
Indian, Eskimo, and Aleut
population
Asian Pacific population
Other Race population
Native Hawaiian and Other
Pacific Islander population
Multiple Race population
Hispanic population
Children Less than or Equal
to 6 Years Old
Median Age*
Median Age of Children Less
than or Equal to 6 Years Old*
Single Parent* = Single Male
with Children + Single
Female with Children
Less than a 9th grade
Education
Less than high school* = #13
+ persons with 9th to 12th
grade education without
obtaining a high school
diploma
Less than college* = #14 +
persons with high school
diploma, but no college
experience
Less than college degree* =
#15 + persons that attended
college without obtaining a
college diploma
Format
Count
Count
Count
Count
Count
Count
Count
Count
Count
Count
Count
Statistic
Statistic
Count
Count
Count
Count
Count
Calculation
Land Area
(Units =.00 1km2)
Land Area
(Units =.00 1km2
Persons
Persons
Persons
Persons
Persons
Persons
Persons
Persons
Persons
Household with
Children Less than or
equal to 18 years old =
Married Couple with
children + Single Male
with Children + Single
Female with Children
Persons 18 years old
and over
Persons 18 years old
and over
Persons 18 years old
and over
Persons 18 years old
and over
Analyzed Variable
Population Density
Housing Density
Pet White
Pet Black
Pet American Indian and
Alaskan Native
Pet Asian
Pet Other Race
Pet Native Hawaiian and
Other Pacific Islander
Pet Multiple Race
Pet Hispanic
Pet le 6 years
Median age of persons
Median age of persons LE
6 years
Pet Single Parent
Pet less than 9th grade
Pet no HS degree
Pet no college
Pet no college degree
17
-------
Table 3-1. (continued)
Variable
Group
Income
Poverty
Level
Housing
Units
Census Variable*
Household Median Income
Family Median Income
Per Capita Income
Households without earnings
Households without wages
Households that obtain public
assistance
Persons below poverty level
Persons who are less than or
equal to five years old that are
below poverty level*
Families with total income
below the poverty level
Families with total income
below the poverty level that
have children under 5 years
old.
Vacant
Housing Units Built before
1940
Housing Units Built before
1950
Housing Units Built before
1960
Housing Units Built before
1970
Housing Units Built before
1980
Median Year that Housing
Units were Built
Median Year that Housing
Units were Built - Calculated
by the Project Team
Format
Statistic
Statistic
Statistic
Count
Count
Count
Count
Count
Count
Count
Count
Count
Count
Count
Count
Count
Statistic
Statistic
Calculation
Households
Households
Households
Persons for whom
poverty status is
determined
Persons who are less
than or equal to five
years old for whom
poverty status is
determined
Families
Families with children
under five years old
Housing Units
Housing Units
Housing Units
Housing Units
Housing Units
Housing Units
Analyzed Variable
Standardized Median
Income for Households
Standardized Median
Income of Families
Standardized per capita
income of persons
Pet No Earnings
Pet No Wage or Salary
Pet With Public Assistance
Pet Persons Below Poverty
Pet Persons Below Poverty
of Age LE 5 Below
Pet Families Below Poverty
Pet Poverty of Families
with Children LT 5
Pet Vacant
Pet Pre 1940 Housing
Pet Pre 1950 Housing
Pet Pre 1960 Housing
Pet Pre 1970 Housing
Pet Pre 1980 Housing
Median Year Built
Calculated Median Year
Built
18
-------
Table 3-1. (continued)
Variable
Group
Occupied
Housing
Units
Housing
Value
Census Variable*
Housing Units that are rented
Occupied Housing Units Built
before 1940
Occupied Housing Units Built
before 1950
Occupied Housing Units Built
before 1960
Occupied Housing Units Built
before 1970
Occupied Housing Units Built
before 1980
Median Year that Occupied
Housing Units were Built
Median Rent
Value of Owner Occupied
Housing Units
Format
Count
Count
Count
Count
Count
Count
Statistic
Statistic
Statistic
Calculation
Occupied Housing Units
Occupied Housing Units
Occupied Housing Units
Occupied Housing Units
Occupied Housing Units
Occupied Housing Units
Analyzed Variable
Pet Renter Occupied
Pet Pre 1940 Occupied
Housing
Pet Pre 1950 Occupied
Housing
Pet Pre 1960 Occupied
Housing
Pet Pre 1970 Occupied
Housing
Pet Pre 1980 Occupied
Housing
Median Year Built -
Occupied Only
Standardized Median
Gross Rent
Standardized Median
Housing Unit Value
* Variables that were created by combining different pieces of information from the 2000 Census
Income and Poverty
Median income per household, family, and person were calculated. Additionally, the proportion
of households that do not receive any wages, do not receive any earnings, and do receive public
assistance were investigated. The census defines earnings and wages as follows:
• "Earnings" represent the amount of income received regularly before deductions for
personal income taxes, Social Security, bond purchases, union dues, Medicare
deductions, etc.
• "Wages" include total money earnings received for work performed as an employee
during the calendar year 1999. It includes wages, salary, Armed Forces pay,
commissions, tips, piece-rate payments, and cash bonuses earned before deductions were
made for taxes, bonds, pensions, union dues, etc.
Similar to the income variables described above, the poverty level of individuals and families
within each county were summarized as the variables Percent Persons and Percent Families
Below the Poverty Level. In order to focus on the poverty level of the children within each
county, however, Percent Persons Five Years and Under and Percent Families with Children
Under Five Years Below Poverty Level variables were created. Note that in calculating the
various percentages for each of the variables, the denominator changes. Also note that for some
of the multivariate models presented later in the report, some of the income variables may have
been rescaled to represent income in thousands of dollars, to allow the parameter estimates for
the regression models to be discernable within the first 3 significant digits.
19
-------
Race
The Census Bureau presents five general race groups; (1) White, (2) Black, (3) Indian, Eskimo,
and Aleut, (4) Asian Pacific and (5) Other, each of which was included and explored separately.
Additional variables were included on percent of Native Hawaiians and Other Pacific Islanders,
percent of the population reporting multiple races (Percent Multiple Races), and percent of the
population reporting that they are Hispanic (Percent Hispanic).
Housing Cost
Two variables were constructed to investigate housing cost - Median Rent and Median Housing
Value. Median Housing Value includes the value of all housing units (owned and rented). Both
of these variables were standardized to account for state-to-state differences in the cost of living.
Note that for some of the multivariate models presented later in the report, some of the housing
cost variables may have been rescaled to represent housing costs in thousands of dollars, to allow
the parameter estimates for the regression models to be discernable within the first 3 significant
digits.
Occupancy
Occupied housing units are more likely to have lead paint removed than vacant homes. Thus, the
percent of housing units that are vacant potentially indicates the level of care taken to maintain
buildings within the area. Buildings that are not occupied are more likely to accumulate dust or
debris to which the children of an area may be exposed upon reoccupancy. Percent of vacant
housing units was explored for those reasons. Similarly, the standard of care could be different
between rental properties and owner-occupied properties. Thus, the percent of rental units in an
area also was explored. The percent of occupied housing units that are rented, rather than
owned, was calculated by dividing the number of rented occupied housing units within an area
by the total number of occupied housing units.
Family Structure
The Census Bureau does not supply a unique variable that indicates the number of single parent
households within an area. Therefore, this variable was created by combining Census variables
as follows:
M = Number of Households with a male householder (no wife present) whose own
children are under 18 years old
F = Number of Households with a female householder (no husband present) whose own
children are under 18 years old
T = M + F + Number of married couples with own children under 18 years.
The Percent of Single Parent Households variable used represented (M+F)/T.
Housing Age
During the 1950s, as the United States started to become aware of the consequences associated
with the exposure of lead in paint, the use of lead paint within homes began to decrease. In
1977, however, the use of lead paint in homes became illegal. Thus, the years during which the
housing units were built within each area is important to characterize; older homes are more
likely to contain lead paint than newer homes. A number of variables related to housing age by
county were investigated to identify those that best predict children's blood-lead levels. Census
20
-------
data on the full population of housing units as well as the population of occupied housing units
were investigated. Note that for some of the multivariate models presented later in the report, the
median age of house variable was centered at 1950 to provide stability to the intercept term in
the models.
Children's Age
The Census Bureau does not report all data by single years of age. More typically the agency
reports the total number of people that fall into various age categories. The variable, "Pet LE 6
years" was created to identify the number of children within each geographic area less than or
equal to six years of age at the time of the 2000 Census. Additionally, the median age of the
total population and of those less than or equal to six years old was calculated by taking a
weighted average of the midpoint of each age category (the counts are used as the weights).
Education
A series of variables pertaining to the proportion of adults with various levels of education were
created as follows:
L9 = Number of people older than 18, that have less than a 9th grade education
L12 = Number of people older than 18, that have 9th though 12th grade experience, but
do not have a high school diploma
12 = Number of people older than 18, that obtained a high school diploma or GED
C = Number of people older than 18, that have some college experience but did not
receive a college degree
T = Number of People that are over than 18 years old
Percentage variables were created from the L9 through C variables by dividing them by the total
number of people over 18 years old. Exploratory analyses were conducted upon the four
percentage variables.
Population Variables
Because both counties and census tracts vary with respect to spatial area and population, and
previous work suggests that risk of childhood lead poisoning differs between rural and urban
areas, a population density variable was used as a potential explanatory variable or effect
modifier in the statistical models. Population density was explored in two ways. The first
divides the number of people within the tract by the amount of land area measured in .001 square
kilometers. The second divides the number of housing units by the amount of land area
measured in .001 square kilometers. Housing units include the following: a house, an apartment,
a mobile home, a group of rooms, or single room that is occupied as separate living quarters.
3.3 Environmental Data
Environmental data acquired for this project include air and groundwater monitoring data
aggregated at the county level for the low-resolution model and at higher resolutions for the
Massachusetts analyses. In cases where the data were available for a limited number of air-
monitoring stations or drinking water samples available for the region(s) being investigated, geo-
spatial modeling techniques might be used as appropriate to develop predictions across the entire
region. Existence of industrial sources of lead within each county and census tract, as indicated
21
-------
by the Toxics Release Inventory (TRI), also were included as an environmental data source.
Each of theses data sources are discussed in further detail below.
3.3.1 Concentrations of Lead in Air
EPA maintains a number of ongoing air monitoring programs that collect data over time on
concentrations of various criteria air pollutants, air toxics, constituents of particulate matter, and
other airborne chemicals. Each of these monitoring programs have multiple air monitoring
stations that are deployed throughout the country to meet various goals associated with the Clean
Air Act and other federal and state regulations and programs. For example, some of the
monitoring stations are placed in close proximity to industrial sources of pollution and major
populations centers, while other stations are placed in remote areas to assess background
chemical concentrations. While many of these monitoring sites provide information on the
concentration of lead in air over time, a quick assessment of the spatial coverage of these
monitoring networks suggested that making use of these data would be problematic for this study
due to time and resource constraints. Lead concentrations in air from the monitoring networks
are not available in the majority of counties that will be covered in the low-resolution model, or
the census tracts that will be covered in the high-resolution models - as shown at the following
EPA Website (http://www.epa.gov/airtrends/lead.html).
Rather than using air monitoring data as described above, the study used modeled predictions of
concentrations of lead in air from EPA's 1999 National Scale Air Toxics Assessment - in which
county and census-tract-level predictions are available throughout the entire country based on the
use of predictive models. Documentation for the 1999 National Scale Air Toxics Assessment, as
well as the predicted air concentration data can be found at
http://www.epa.gov/ttn/atw/natal999/tables.html. The predictions were generated using the
Assessment System for Population Exposure Nationwide, or ASPEN. This model is based on
the EPA's Industrial Source Complex Long Term model (ISCLT), which simulates the behavior
of the pollutants after they are emitted into the atmosphere. ASPEN uses estimates of toxic air
pollutant emissions and meteorological data from National Weather Service Stations to estimate
air toxics concentrations nationwide.
The ASPEN model takes into account important determinants of pollutant concentrations, such
as:
. rate of release
location of release
. the height from which the pollutants are released
. wind speeds and directions from the meteorological stations nearest to the release
. breakdown of the pollutants in the atmosphere after being released (i.e., reactive decay)
settling of pollutants out of the atmosphere (i.e., deposition)
. transformation of one pollutant into another (i.e., secondary formation).
The model estimates toxic air pollutant concentrations for every county and census tract in the
continental United States; however, these data are only available for 1999. Both the Broad-
Based National Model and the High-Resolution Model within Massachusetts considered the
integration of information from the ASPEN Model. The National Model investigated the
22
-------
median, average, and 95th percentile predicted air lead concentration within each county, while
the High-Resolution Model only considered the average predicted air lead concentration within
each census tract. Within the National Model, the median, average and 95th percentile predicted
air-lead concentrations were mostly distributed near zero. For this reason, zero/one indicator
variables were created to indicate that the observed value of these ASPEN Model predictions
were observed at or above the 95th and 99th percentile within the analysis dataset for potential use
within the predictive models. In addition, EPA collaborators identified a subset of 20 counties
with observed elevated air-lead concentrations, and an indicator variable was used to assess
whether these 20 counties had higher risk of childhood lead poisoning in the predictive models.
The second air-lead variable investigated is based on predictions from the HAPEM5 (Hazardous
Air Pollutants Exposure Model, Version 5) model. According to the EPA website, "the
HAPEM5 model has been designed to predict the 'apparent' inhalation exposure for specified
population groups and air toxics. Through a series of calculation routines, the model makes use
of census data, human activity patterns, ambient air quality levels, climate data, and
indoor/outdoor concentration relationships to estimate an expected range of 'apparent' inhalation
exposure concentrations for groups of individuals."2 Because air quality concentrations in
indoor environments can be quite different than those in the outdoor environment, an exposure
model generally is employed to predict the apparent inhalation exposure. The Air Exposure
(HAPEM5) model variable captures the predicted exposure data from this model.
The third air-lead variable considered, Air Hazard Quotient (HQ), is derived from the 1999
National Scale Air Toxics Assessment data. This variable represents lifetime exposure for
children at the centroids of each census tract or county. Lifetime exposure is calculated based on
considering annual exposures and yearly activity patterns. The HAPEM5 and HQ air-lead
variables were only considered within the context of the High-Resolution Model within
Massachusetts.
3.3.2 Toxic Release Inventory Variables
EPA's Toxics Release Inventory (TRI) catalogs various sources of lead, based on information
provided by industrial facilities. This data source was used to generate county- and census-tract-
level estimates of the total amount of lead and/or lead-containing compounds that are released by
industrial facilities into the environment via air, surface water, or underwater injection.
Although the above-described ASPEN modeling results are based on the (airborne) emissions
data and how they would theoretically translate into average ambient air-lead concentrations, the
data from the TRI are available for multiple years and for other types of emissions (such as
surface water). Thus, this information has the potential to add predictive power to the models.
Three types of TRI variables were utilized - total compounds, lead only, and total lead. Within
each type, five pollution variables were explored - total lead in the air, lead in fugitive air, lead
from smokestacks, lead in surface water, and lead in water by injection. Thus, 15 total TRI data
variables were evaluated.
: http://epa.gov/ttn/atw/natal999/ted/teddraft.html
23
-------
Within the National Model, the distributions of the TRI emissions variables were mostly
concentrated near zero. For this reason, additional zero/one indicator variables were created to
indicate that the observed value of these TRI emissions were observed at or above the 95th and
99th percentile within the analysis dataset for potential use within the predictive models.
3.3.3 Water Quality Data
The plumbing system inside a home and the service line from the street to the home may contain
lead and can contribute to drinking water contamination. To address this potential source, EPA
obtained data from their Safe Drinking Water Information System that includes the 90th
percentile result of tap water lead levels for public water systems. Public water suppliers must
monitor at customer's taps every 6 months. Public water systems can reduce monitoring to
annually, triennially, or every 9 years (if granted a monitoring waiver) if the 90th percentile value
from previous monitoring is at or below the action level of 15 parts per billion. The number of
customer's taps, or monitoring sites, that a system is required to sample is based on the
population served by the system. Further, systems are required to select sites that are most likely
to have the highest lead levels (i.e., older homes, homes with copper pipes with lead solder or
homes served by a lead service line). Therefore, the 90th percentile value of samples collected
during a monitoring period is not reflective of individual exposure to lead in drinking water.
Data available from this monitoring program include 90th percentile water lead values for public
drinking water systems serving greater than 3,300 persons (systems serving less than 3,300
persons are required to report the 90th percentile level only if they exceed the action level), the
population size served by each facility, the start and end date for the monitoring period, and the
county in which the facility is located. These data were used to construct a population-size
weighted average 90th percentile water-lead concentration variable within each county/quarter
combination. However, it is important to note that most public water systems do not remain
within county lines. Large water systems may serve multiple counties or a county may be served
by several small public water systems.
Because there were some county/quarter combinations with no observed data from EPA's Safe
Drinking Water Information System, an indicator variable was developed to indicate whether the
county/quarter included a monitored facility (or not) - allowing an intercept to be fit among
those county/quarters with no drinking water monitoring, and a slope estimate to be fit for the
effect of the weighted average 90th percentile drinking water-lead concentration among reported
facilities.
EPA's Safe Drinking Water Information System data were not geocoded to the census-tract
level, and therefore these data were only available for use in supporting the Broad-Based
National Model at this time.
3.4 Programmatic Data
Most of the explanatory variables being explored in this project are considered risk factors for
childhood lead poisoning. Among factors that might mitigate these risks, it was anticipated that
the level and characteristics of programmatic support from either federal, state, or local sponsors
may contribute toward meaningful reductions in the prevalence of childhood lead poisoning.
The level of financial support available within each county served as a proxy for programmatic
24
-------
support in the low-resolution (National) models. In the high-resolution models run for
Massachusetts, information from housing inspections also were explored within the statistical
models. The following sections detail the specific characteristics of the variables used within the
models.
3.4.1 Programmatic Funding Variables
The goal of this variable is to construct a longitudinal history of current and cumulative
per-capita dollars allocated to each county and census tract to combat childhood lead poisoning.
For use in both the national and state models, data were obtained from HUD's Office of Healthy
Homes and Lead Hazard Control on grants funded since the inception of the Lead-Based Paint
Hazard Control Grant Program in 1992. Data also were obtained from CDC's Lead Poisoning
Prevention Branch on their program's grant funding approximately three weeks prior to the end
of this project, and therefore these data were only able to be integrated into the Massachusetts
models due to time constraints.
Four variables were generated from these data and analyzed - current and cumulative funding
allocated to each county or census tract to combat childhood lead poisoning, both Standardized
by number of children per tract and Not Standardized. The Standardized variable is a funding
per child variable while the Not Standardized versions are funding for geographic area variables.
For the high-resolution model in the Commonwealth of Massachusetts, information on within-
state funding levels was obtained and analyzed. Within-state funding data were available down
to the township level. The state, HUD, and CDC funding data also were combined to create
Total Funding variables, including both current and cumulative levels and both Standardized and
Not Standardized versions. The total funding variables also were only investigated as part of the
Massachusetts analyses.
Becuase there may be delays in the effects of programmatic funding on risk of lead poisoning,
time-lagged versions (at 6-, 12-, 18-, 24-, 30-, and 36-months) of the programmatic funding
variables in the National Model also were investigated.
3.4.2 EPA Region
The EPA region was investigated as a potential predictor of children's blood-lead levels to
determine if that high-level geographic indicator should be included as a stratification variable in
the national multivariate models.
3.4.3 Housing Inspection Data (Massachusetts)
The Commonwealth of Massachusetts maintains an extensive database on all lead-based paint
inspections conducted over time (dating back to the early 1990s). The MDPH provided a
database that contains a single record for each inspection, with the following information:
housing-unit id, census tract, date of inspection, and result of inspection (whether the housing
unit was found to be in compliance with Massachusetts standards). The database contains
records on over 200,000 housing units - with many housing units having multiple inspections
25
-------
over time. Note that for units with multiple records, time periods in which the units were both in
and out of compliance with the Massachusetts standards were identified.
These data can be used in the Massachusetts high-resolution models in two ways. First, a
longitudinal summary measure of the proportion of housing within each census tract that was
known to be in compliance with the Massachusetts standards was developed. It was anticipated
that within a census tract, as this proportion increases over time, the risk of childhood lead
poisoning will decrease. Second, due to the fact that individual blood-lead records from
Massachusetts with linkable housing-unit identification variables were available, a determination
could be made regarding whether a housing unit was in compliance at the time of the blood-lead
test for each child in the database (with potential outcomes of the determination being yes, no,
and unknown).
The first approach described above is consistent with the methods for exploring aggregated
summary blood-lead information over time within each census tract. The second approach
allows utilization of some predictive information at the individual child level. This information
may help improve prediction, and also may help assess what information might be lost when
transitioning from individual-level data to aggregate summary data in the analyses.
Unfortunately, due to time and resource constraints, only the first method was explored within
this project. Thus, the three measures listed below were calculated using four different methods.
The three measures are:
• P - represents the Proportion of Housing Units within a census tract that are assumed to
Meet the Massachusetts Standard of Care at any given time
• F - represents the Proportion of Housing Units within a census tract that are assumed to
Not Meet the Massachusetts Standard of Care at any given time
• N - represents the Proportion of Housing Units within a census tract with Housing
Inspection Information at any given time.
As noted, the measures were generated in four different ways, each handling the longitudinal
information in a slightly different manner. The four measures, numbered in the model results
from 1 to 4, are listed below.
1. Naive Method 1 - Create a longitudinal history for each housing unit inspected, and
treat the first inspection observation as being representative for time periods
preceding that inspection.
2. Naive Method 2 - Create a longitudinal history for each housing unit inspected, and
assume missing information for time period preceding the first test on each unit.
3. Naive Method 3 - Create a longitudinal history for each housing unit inspected, and
treat the first inspection observation as being representative for time periods
preceding that inspection if the housing unit failed, and assume missing information
for time period preceding the first test if the unit passed.
4. MDPH Approved Method - Create a longitudinal history for each housing unit, with
different rules for the treatment of the time-period preceding the first test based on (a)
the housing inspection result and (b) the reason for ordering the inspection.
26
-------
Note that for housing units with multiple inspections, each housing inspection result is assumed
to be representative of the house (either pass or fail) until the next result. The last result is
carried forward over time (e.g., if the last observed inspection on a house passed in November of
1998 - that particular house is assumed to be meeting the Massachusetts standard of care over all
subsequent time periods in the dataset). If multiple inspections occur on the same house within a
particular quarter (3-month interval), the maximum result (with pass being coded as a 1, and fail
being coded as zero) is used to represent the house. The 0/1 results are then summed across all
observed housing units within each census tract over time (quarters). The summed results are
then divided by the number of housing units reported within each census tract from the 2000
Census.
While all of the above described housing inspection variables were investigated in the
exploratory data analyses, only the P4, F4, and N4 variables associated with the MDPH-
approved method of constructing the longitudinal history within each housing unit observed was
considered within the context of the multivariate models.
3.5 Data Linkages
The primary objective of this pilot study was to utilize combined information from different
sources at various levels of geographic and temporal specificity to more accurately target
geographic areas at high risk for not meeting the 2010 goal of eliminating childhood lead
poisoning. As such, work on the study required careful integration of a variety of data sources
with various characteristics and documentation. Data to support this study were gathered from a
variety of sources, including federal, state, and local lead poisoning prevention programs, as well
as publicly available data downloaded from the internet (e.g., Census data, EPA's Toxics
Release Inventory), as detailed in the previous sections.
Upon receipt of each data source, the data and supporting documentation was reviewed to gain
knowledge on the structure, relationship, and quality of the data. Database managers worked
with the project team (including collaborators providing data to the project, as well as EPA) to
determine the final format for each database, desired uses of the databases, as well as the
requirements for maintaining the databases. Based on this information, separate master
databases were constructed for the national model and for the high-resolution Massachusetts
model that integrate the various environmental, demographic, and programmatic variables, and
facilitate statistical analyses of the combined data. These databases were constructed by
combining data from a variety of formats including MS SQL Server, MS Access, Excel, ACSII,
Access, Arc View, and SAS® electronic databases. In order to combine the various data sets,
they were merged on key fields, including state, county, census tract, and time period. The data
being used for analyses of a particular geographic level (e.g., county) are comparable because
they are representative of that geographic area.
Throughout the development process, checks for completeness were conducted on all study
databases, and the project team worked with data-sharing collaborators and EPA to attempt to
complete missing data as necessary to support the proposed statistical analyses. Any changes to
the databases (corrections, additions, deletions, etc.) were documented in appropriate metadata
files. Documentation of the combined master databases is included in Appendix H.
27
-------
Standard Operating Procedures (SOPs) were followed to ensure the proper storage, backup, and
retrieval of datasets created and analyzed for this study. The various databases were backed up
to tape nightly via automated backup routines, and were only accessible to members of the
project team. CD-ROM backups were made on a regular basis to serve as a safeguard in case the
backup system failed for any reason.
Microsoft Access and SQL server were the primary software tools used for data management.
The SAS® System was the primary statistical data analysis tool used on this project. Arc View
software was used to translate results into maps, as seen in Appendices F and G.
The data utilized for the study were maintained in a manner that preserved the confidentiality of
all the data and prevented its unauthorized release. As data files were received from EPA, the
original data (e.g., data with personal identifiers) were handled as though they were classified as
confidential business information (CBI) under the Toxic Substances Control Act (TSCA), even
though EPA may not specifically classify these data as "CBI." The data files were not shared
with anyone outside of the project team.
28
-------
4.0 EXPLORATORY DATA ANALYSES
Because the goal of this study was to develop a series of statistical models that predict the risk of
childhood lead poisoning at the geographic level across multiple response variables (proportion
of children screened at or above 5, 10, 15 and 25 ug/dL), all potential predictor variables first
were explored individually to determine their predictive ability. Results from these bivariate
analyses were assessed to identify the set of variables to include in the multivariate model that
predicts how the risk of childhood lead poisoning changes over time among the various census
tracts and counties included in the analysis.
This section of the report provides the results of the series of exploratory analyses described in
Section 2.2, which were performed to assess the potential predictive power of various candidate
demographic, environmental, and programmatic variables for potential use in the multivariate
models. These exploratory analyses initiated with an assessment of the study sample, i.e., the
proportion of counties in the sample with complete and reliable data for both the response
variable and the explanatory variables.
Each candidate predictor variable was reviewed with particular attention focusing on the manner
in which the county-level predictor variables would be merged with the quarterly summary
blood-lead information prior to fitting the statistical models. In preparation for developing
longitudinal statistical models, univariate summaries of each predictor variable as a function of
time were produced. Comparisons of these distributions were made using side-by-side box-plots
for continuous data or bar-charts for categorical data. This helps verify that the data are clean
and ready for analysis and helps identify cells with sparse data. Such descriptive analyses were
conducted on each predictor variable database to characterize the distributions of all observed
variables using frequency distributions for categorical variables, and simple summary statistics
(mean, median, mode, minimum, maximum, and select percentiles) for continuous variables.
The univariate descriptions then were followed by fitting a series of cross-sectional bivariate
relationships between the blood-lead response variable(s) and each candidate explanatory
variable. These cross-sectional relationships were explored as a function of time to better
understand the stability of these relationships, and whether they change over time, so that they
can be modeled appropriately in the more sophisticated longitudinal analyses. These analyses
also help identify which explanatory variables are most predictive of the blood-lead response
variable.
4.1 Relationship between National Blood-Lead Data and Explanatory Variables
The response variable for the national data analysis consisted of quarterly summary statistics
from 1995-2005 on the distribution of observed blood-lead concentrations in counties across the
nation, based on information from CDC's National Childhood Lead Poisoning Surveillance
Database. The time series of summary statistics within select counties were initially investigated
to determine appropriate exclusion criteria to ensure that the data retained for analysis
represented blood-lead concentrations that were universally reported (i.e., there were periods of
time in which some state or local childhood lead poisoning prevention programs only reported
elevated blood-lead concentrations - and these data needed to be eliminated from the analysis).
29
-------
Thus, the number of quarterly summary statistics varied from county to county within the
analysis dataset.
The national blood-lead data were categorized into four time periods - (1) January 1, 1995 to
December 31, 1999; (2) January 1, 2000 to December 31, 2001; (3) January 1, 2002 to December
31, 2003, and (4) January 1, 2004 to December 31, 2005 - so that change over time could be
evaluated. Using the specified four time periods split the dataset of quarterly county-level
records into roughly similar sizes. Presented below are the exploratory analysis results for the
demographic, environmental, and programmatic variables investigated. Detailed figures and
tables containing results are included in Appendix A. A detailed discussion of the results seen in
Appendix A is contained in Appendix D.
To allow comparison of the different variables explored within each variable type, Tables 4-1
through 4-4 present the log-likelihood statistic from each single covariate model presented in
Appendix A for each of the four blood-lead threshold values, respectively. Each explanatory
variable was investigated in four different ways with respect to how the effect of the variable
might vary over time within the longitudinal analysis dataset:
1. Investigate the explanatory variable on its own, assuming that the effect remains stable
over time.
2. Investigate the explanatory variable with a linear interaction with time, assuming that the
effect of the variable on risk of childhood lead poisoning either increases or decreases
linearly over time (on the logit scale).
3. Investigate the explanatory variable with a quadratic interaction with time, assuming that
the effect of the variable on risk of childhood lead poisoning either increases or decreases
as a quadratic function in time (on the logit scale).
4. Investigate the interaction between the explanatory variable and four select time periods,
which is helpful for diagnosing whether the effect remains stable (or changes) over time -
but is not particularly useful for the final multivariate model where the application of the
model might be to forecast how risk of lead poisoning might extend into future years.
Within each variable category, the variable that provided the best fit across the four time
variables is indicated with a double asterisk (**). For example, within the income category, the
Categorical time variable achieved the best fit for seven of the eight income variables in the
model of proportion of children with blood-lead levels above 5 ug/dL. Within that category,
Percent No Household Wage achieved the best model fit across the 8 income variables. Those
variables (indicated with the double asterisk) were the most likely to become candidate
predictors for the multivariate statistical models.
30
-------
Table 4-1. Summary of Exploratory Analysis Fit as shown by -2 Log Likelihoods for
Pr(PbB >= 5 ug/dL) Models
Parameter
Category
Income
Race
Housing
Cost
Occupancy
Single
Parent
Home Age
Children
Education
Population
Variable Name
Median Family Income
Median HH Income
Median_Per_Capita_Income
Pet HH No Earnings
Pct_HH_No_Wage
Pct_LT_Poverty
Pet Family Income LT Poverty
Pct_LE_5Yrs_LT_Poverty
Pet Asian
Pet Black
Pet White
Pet NHOPI
Pet Other Race
Pet Multi Race
Pct_Hispanic
Median Rent
Housing_Value
Pet Rented
Pet Vacant
Pct_Single_Parent
Median Yr Built
Median Yr Occ Built
Pet Built Pre 1940
Pet Built Pre 1950
Pet Built Pre 1960
Pet Built Pre 1970
Pet Built Pre 1980
Pet Occ Built Pre 1940
Pet Occ Built Pre 1950
Pet Occ Built Pre 1960
Pet Occ Built Pre 1970
Pet Occ Built Pre 1980
Pet LE Six
Num LE Six
Pet LT 9th Grade
Pet No HS Degree
Pct_No_College
Pet No College Degree
Total_Housing_Units
Total_Pop
Housing_Density
XOnly
235495.9
235611.3
235579.7
235761.5
235769
235867.4
235915.1
235888.9
235862.1
235851.8
235265.4
235571.9
235913.4
235884.8
235878.1
235412.7
235268.9
235344.5
235449
235487.1
235277.5
235233.3
235458.6
235495.4
235827.7
235905.5
235850.1
235556.5
235906.3**
235908.5
235920.4
Linear
Time
235522.9
235600.3
235601.3
235567.8
235763.3
235744.8
235794.2
235681.3
235848.1
235817.2
235189.1
235562.9
235887.7
235872.2
235884
235421.5
235432.6
235277.3
235241.8
235357.7
235463.1
235503
235285.1
235245.1
235359.2
235471.3
235510.7
235829.8
235924.2
235848.4
235715.8
235509.4
235505.7
235922.2
235928.6
235925.8
Quadratic
Time
234626.6
234404.8
234323.0**
234363.2
234446.2
234373.9
234627.7
234612.0**
235259.3
235536.9
235559.1
234679.6
234046.8**
235128.2
235427.2**
234527.9**
235483.4
234214.3
233803.4
233574.5
233484.5**
233650.7
233834.3
233834
233601.1
233499.6
233666
233840.8
234448.4**
234398.6
234321.2
234237.9
234226.0**
Categorical
Time
233514.6
233439
233735.4
233260.3
233183.8*
233608.5
233751.1
234511.6
233688.1
235918.5
235250.1
234595
233502.6*
233401.0*
234857.9
234479.4*
233660.7*
234415.3
233119.1
233243.7
232990.3
232839.7*
232904.6
232946.2
233291
233030.9
232866.4
232928.4
232959.7
233449.9
227590.8*
233744
233411.6
233106.0*
228102.6*
235699.2
31
-------
Parameter
Category
Air Lead
TRI
Variable Name
air_avg
air med
air_p95
air_avg_p95
air_med_p95
air_p95_p95
air_avg_p99
air_med_p99
air_p95_p99
TRI Compounds air_fug
TRI Compounds air_tot
TRI Compounds air_stk
TRI Compounds under_inj
TRI Compounds water_surf
TRI Lead Only air_fug
TRI Lead Only airjot
TRI Lead Only air_stk
TRI Lead Only under_inj
TRI Lead Only water_surf
TRI Lead Total air_fug
TRI Lead Total air tot
TRI Lead Total air stk
TRI Lead Total under_inj
TRI Lead Total water surf
tri_asl_p95
tri_as2_p95
tri_as3_p95
tri_afl_p95
tri_af2_p95
tri_af3_p95
tri_atl_p95
tri_at2_p95
tri_at3_p95
tri_wsl_p95
tri_ws2_p95
tri_ws3_p95
tri_uil_p95
tri_ui2_p95
tri_ui3_p95
tri_asl_p99
tri_as2_p99
tri_as3_p99
tri_afl_p99
tri_af2_p99
tri_af3_p99
tri_atl_p99
tri_at2_p99
XOnly
235905.9
235905
235907.6
235906.4
235907.2
235911.9
235903.5
235905.8
235922.9
235921.5
235925.3
235921.3
235927.2
235929.4
235928.7
235928.5
235926.6
235927.3
235930
235928.4
235926.7
235909
235912.6
235907.5
235914
235910.9
235911.7
235910.2
235908.1
235908
235904.1
235904.1
235904.1
235908.3
235906.6
235906.1
235907.5
235905.7
Linear
Time
235906.7
235905.5
235909.8
235909.6
235135.4**
235918.5
235906.2
235943.8
235940.6
235940.7
235942.9
235955.8
235955.2
235952.3
235950.7
235957.3
235955.9
235956.1
235917.9
235910.7
235919.2
235916.8
235915.8
235915.7
235919.1
235915.4
235904.1
235904.1
235904.1
235908.5
235914.4
235912
235912.6
235909.1
235915.6
Quadratic
Time
235815.6**
air_med_p95
236090.1
235766.3**
235684.6
235406.7
235968.5
235910.3
235312.9
235385.1
235985.8
235244.9**
235478.1
235984.3
234966.5
234262.7
235132.3
234837.1
234396.5
234716.4
233993.9
233685.3**
235370.9
233861.9
234661.9
234661.9
234661.9
235519.5
235104.8
235735.5
234680
235128.3
235711.7
Categorical
Time
235836
235921
235735.7*
233727.0*
235914.4
235244.7
235718.3
235695.8
235325.2
235199.3
235355.6
235986.4
235912.1
235155.4
235169.4
235437.5
235997.1
235942.9
234850.4*
235036
235321.1
235999.2
235933
235011.7
235125.1
233669.7
233058.4
234606.5
232572.7
232564.5*
235422
233253.8
233559.3
233559.3
233559.3
234116.8
234668.4
235364.2
234704.6
32
-------
Parameter
Category
Funding
Variable Name
tri_at3_p99
tri_wsl_p99
tri_ws2_p99
tri_ws3_p99
tri_ui2_p99
tri_ui3_p99
CDC cur Iag6
CDC_cur_lagl2
CDC_cur_lagl8
CDC cur Iag24
CDC_cur_lag30
CDC cur Iag36
HUD_cur_lag6
HUD_cur_lagl2
HUD cur laglS
HUD_cur_lag24
HUD cur Iag30
HUD_cur_lag36
CDC_cum_lag6
CDC cum Iagl2
CDC_cum_lagl8
CDC cum Iag24
CDC_cum_lag30
CDC_cum_lag36
HUD cum Iag6
HUD_cum_lagl2
HUD cum laglS
HUD cum Iag24
HUD_cum_lag30
HUD cum Iag36
tot_cur_lag6
tot cur lag 12
tot cur lag 18
tot_cur_lag24
tot cur Iag30
tot_cur_lag36
tot cum Iag6
tot cum lag 12
tot_cum_lagl8
tot cum Iag24
tot_cum_lag30
tot cum Iag36
HUD_cur
HUD_cum
CDC_cur
CDC_cum
XOnly
235907.7
235907
235907.6
235907.6
235904.1
235905
235427.3
235847
235897.1
235727.6
235914
235761.9
235762.7
235707.6
235672.8
235626.3
235611.1
235762.8
235859.9
235898.2
235883
235908.5
235924.3
235961.9
235952.8
235897.3
235794
235918.7
235799.2
235742.9
235657.8
235546.3
235928.9
235958
235944.2
235888.1
235774.3
252824.5
252864.5
252679.7
252586.4
Linear
Time
235913.8
235912.3
235913
235913.9
235904.1
235910.1
235415.6
235275.1
235270.6
235039.4
235246
235050.5
235833.5
235691.3
235673.8
235525.1
235487.1
235273
234603.5
234668.8
234760.3
234851.6
234933.1
234920.4
234878.9
235051
235099.6
235132.9
235148.6
235841.5
235727.4
235654.7
235439.1
235355.1
235043.1
234735
234825.3
234876.7
234937.9
252845.5
252670.1
252170.2
252213.5
Quadratic
Time
235213.9
235863.4
235667.2
235706.5
234661.9
235921.5
234816.9
234678.5
234599.8
234566.7
234702.5
235432.4
235358.3
235437.5
235434.3
235445.4
234200.7
234304.8
234472.7
234646.1
234793.8
234824.7
234725.3
234650.3
234558.3
234450.4
234334.5
235210
235169.7
235245.5
235226.8
235216.4
235054.3
234463
234380.2
234340.7
234190.6
234115.8**
252673.7
251830.5
252226.6
Categorical
Time
234780.2
235847.8
235712.8
235748.5
233559.3
235887.1
234090.5
233854.6
233781.8*
234162.9
235494.8
235381.6
235352.7
235211.7
235175.9
234912.4
233952.6
234162.5
234433.2
234466.2
234382.8
234413.4
234503.5
234499.3
235306.5
235194.9
235101.6
234897.9
234839.7
234518.3
234067.2
234123.9
234168.7
234192.1
234217.6
252876.3
252679.3
251799.9
252202.3
33
-------
Parameter
Category
Funding
Screening
Variable Name
Current -HUD+CDC
Cumulative - HUD+CDC
screen_penetration
XOnly
235940.8
232188
Linear
Time
235788
232220.8
Quadratic
Time
235296.1
234583.8
231442.3**
Categorical
Time
235214.1
234050.2
230769.8*
** Variable factor(s) showed best fit when adjusted for degrees of freedom and were thus chosen to represent
parameter category in multivariate analysis.
* Variable factors showed best fit; however, were not included in multivariate analysis because the time
categorical variable had less than ideal prediction properties.
Table 4-2. Summary of Exploratory Analysis Fit as shown by -2 Log Likelihoods for
Pr(PbB >= 10 ug/dL) Models
Parameter
Category
Income
Race
Housing Cost
Occupancy
Single Parent
Home Age
Children
Variable Name
Median Family Income
Median HH Income
Median_Per_Capita_Income
Pet HH No Earnings
Pet HH No Wage
Pet HH Public Assist
Pet LT Poverty
Pet Family Income LT Poverty
Pct_LE_5Yrs_LT_Poverty
Pet Asian
Pet Black
Pet White
Pet NHOPI
Pet Other Race
Pet Multi Race
Pct_Hispanic
Median Rent
Housing_Value
Pet Rented
Pet Vacant
Pct_Single_Parent
Median Yr Built
Median Yr Occ Built
Pet Built Pre 1940
Pet Built Pre 1950
Pet Built Pre 1960
Pet Built Pre 1970
Pet Built Pre 1980
Pet Occ Built Pre 1940
Pet Occ Built Pre 1950
Pet Occ Built Pre 1960
Pet Occ Built Pre 1970
Pet Occ Built Pre 1980
Pet LE Six
Num LE Six
XOnly
252403.4
252382.8
252486.2
252446.7
252363.4
252779.8
252743.7
252715.7
252839.5
252668.4
253025.3
252930.5
252775.6
252788.2
252813.9
252008.3
252392.2
253049.2
252968.6
253124
252361
252391.1
252030
252006.6
252250.6
252433.4
252481.8
252035.3
252020.3
252477.8
252508.2
252767
252881.3
Linear
Time
252372.3
252425
252444
252336.2
252806.7
252763.4
252736.5
252848.3
252569.2
253012.9
252888
252837.9
252727.4
252745.4
251790.4
252324.9
252943.3**
252990.6
253084.3
252361
252391.8
251996.7
251977.8
252241.6
252431.3
252001.6
251992
252267.3
252477.2
252502
252725.9
252835.7
Quadratic
Time
251685
251605.3
251792.5
251506.4
251373.8**
251403.6
251392.3
252369.5
252123.9
253276.4
252540.6
251886.2**
251043.0**
252139.5
251927.9**
252263.4
251439.8
251082.9
251073.7**
251254.4
251344.2
251398
251106
251103.2
251371.5
251597.1**
Categorical
Time
251406
251222.3
251439.9
251091
250935.7
250521.9
250602.3
250516
250483.1*
250374.8
251966.4
252285.1
252846.7
252403.1
248350.1*
250455.8*
251135.1*
252368.7
251596.8*
252227.6
250974.6
251206.9
250788.7
250590.1*
250672.3
250756.3
250805.4
250611.8
250713.2
250782.6
251003.9*
34
-------
Parameter
Category
Education
Population
Air Lead
Tri
Variable Name
Pet LT 9th Grade
Pct_No_HS_Degree
Pet No College
Pct_No_College_Degree
Total_Housing_Units
Total_Pop
Housing_Density
air_avg
air med
air_p95
air_avg_p95
air_med_p95
air_p95_p95
air_avg_p99
air_med_p99
air_p95_p99
TRI Compounds air_fug
TRI Compounds air_tot
TRI Compounds air_stk
TRI Compounds under_inj
TRI Compounds water_surf
TRI Lead Only air_fug
TRI Lead Only airjot
TRI Lead Only air_stk
TRI Lead Only under inj
TRI Lead Only water_surf
TRI Lead Total air_fug
TRI Lead Total air tot
TRI Lead Total air stk
TRI Lead Total under_inj
TRI Lead Total water surf
tri_asl_p95
tri_as2_p95
tri_as3_p95
tri_afl_p95
tri_af2_p95
tri_af3_p95
tri_atl_p95
tri_at2_p95
tri_at3_p95
tri wsl_p95
tri_ws2_p95
tri_ws3_p95
tri_uil_p95
tri_ui2_p95
tri_ui3_p95
tri_asl_p99
XOnly
252734.3
252574.1
252279.4
252895.2
252890.7
252854.6
252880.2
252884.8
252876.9
252912.5
252910
252841.7
252839.7
252848.7
252867.9
252889.4
252884.9
252862.5
252864.1
252869.4
252884.3
252882.2
252866.5
252866.8
252871.7
252891.5
252890.3
252867.1
252868.4
252887.8
252922.4
252923.9
252864.3
252883.1
252886.9
252911.6
252904.6
252892.3
252840.1
252840.1
252840.1
Linear
Time
252691.3
252551.1
252216.7
252237.3
252818.7
252834.7
252838.1
252861.4
252867.1
252863.8
252878.3
252728.4**
252885
252844
252842.6
252852.6
252889.1
252901.7
252894.2
252885.6
252885.1
252894.4
252905.2
252900.4
252889.9
252892.3
252896.1
252909.9
252905.6
252889.4
252893.9
252864.5
252886.8
252903.3
252910.9
252862.8
252872.3
252855.8
252893.2
252880.2
252885.1
252895.8
252840.1
252840.1
252840.1
252856.4
Quadratic
Time
251568.3
251197.8
251090.4**
251142.9
251937
251907.9**
252786.9
252906.8
253026.4
252740.9**
253156.2
252840.5
252819.6
252832.5
252914
252880.6
252815.7
252747.2
252653.5**
252707.7
252907.5
252798.3
252702.6
252699.8
252904.5
251601.7
252833
252237.8
251998.1
251677.1
251553.2
253019.4
251407.9**
251422.9
251871.5
251871.5
251871.5
Categorical
Time
249792.4*
250216.7
250602.5
250691.1
243207.7
242397.1*
252547.2
252015.3
251966.2*
252187.9
249593.0*
252883.7
252814.5
252786.5
252314.2
252313.3
252454.4
252919.1
252884.7
252624.6
252138.6
252298.7
252931.1
252900.6
252475
252031.8*
252203.4
252929.7
252897.6
252435.7
247577.4
248106.4
252941.9
248140.5
248650.8
252210.2
247510.1*
248305.6
252996.8
248164.1
251377.4
251377.4
251377.4
35
-------
Parameter
Category
Funding
Variable Name
tri_as2_p99
tri_as3_p99
tri_afl_p99
tri_af2_p99
tri_af3_p99
tri_atl_p99
tri_at2_p99
tri_at3_p99
tri_wsl_p99
tri_ws2_p99
tri_ws3_p99
tri_uil_p99
tri_ui2_p99
tri_ui3_p99
CDC_cur_lag6
CDC_cur_lagl2
CDC_cur_lagl8
CDC_cur_lag24
CDC_cur_lag30
CDC_cur_lag36
HUD_cur_lag6
HUD_cur_lagl2
HUD_cur_lagl8
HUD_cur_lag24
HUD cur Iag30
HUD_cur_lag36
CDC_cum_lag6
CDC_cum_lagl2
CDC_cum_lagl8
CDC_cum_lag24
CDC_cum_lag30
CDC_cum_lag36
HUD cum Iag6
HUD_cum_lagl2
HUD_cum_lagl8
HUD_cum_lag24
HUD_cum_lag30
HUD_cum_lag36
tot_cur_lag6
tot_cur_lag!2
tot cur lag 18
tot_cur_lag24
tot cur Iag30
tot_cur_lag36
tot_cum_lag6
tot cum lag 12
tot_cum_lagl8
XOnly
252867.3
252855.1
252866.8
252851.4
252844.8
252869.5
252863.2
252864.6
252853.8
252858.5
252840.1
252840.1
252842.9
252593.4
252468.6
252652.5
252698.2
252861.4
252848.9
252836
252857.7
252847.7
252821.7
252800.2
252638.8
252817.8
252840.3
252836.1
252859.8
252856.9
252856.7
252852.4
252853.1
252840.8
252848.3
252863.6
252855.8
252796.7
252863.1
252861
252858.7
Linear
Time
252864.7
252867.6
252862.5
252852.6
252850.4
252869.6
252860.2
252865.7
252857.9
252859.9
252840.1
252840.1
252851.7
252079
251799
252011.3
252155.7
252579.9
252659.6
252867.7
252839.2
252873.7
252847.4
252849.1
252769.4
252230.7
252277.9
252422.6
252479.6
252501
252607.7
252583.7
252570.4
252556.3
252857
252826.7
252859.5
252803.6
252838.7
252732.3
252551.2
252531.9
252530.3
Quadratic
Time
252768.7
252781.5
252872.8
252719
252743.1
252865.3
252780.2
252817.4
252679.5
251871.5
251871.5
252822.1
251795.6
251576.9**
251983.6
252298.4
252599.3
252674.2
252693.1
252785.4
252709.2
252762.3
252255.1
252311.3
252387.7
252451.6
252498.3
252515.7
252650.4
252639.5
252636
252612.3
252586.8
252544.6
252532.5
252615.3
252627.1
252686.2
252608.6
252677.3
252577.9
252565.2
252565.1
Categorical
Time
252011.1
251667.2
251999.2
252235.8
252035.9
252047.7
252839.7
252740.5
252756.3
251377.4
251377.4
252824.4
251723.5
251869.8
251948.3
252201.5
252883.1
252823
252856.3
252804.8
252713.8
252258.1
252319.1
252377.5
252424.6
252446.4
252640.5
252615.3
252602.2
252582.4
252559.1
252520.3
252837.6
252765.8
252788.9
252715.8
252773.5
252653.3
252560.4
252537.4
36
-------
Parameter
Category
Funding
Screening
Variable Name
tot cum Iag24
tot_cum_lag30
tot cum Iag36
HUD_cur
HUD_cum
CDC_cur
CDC_cum
Current -HUD+CDC
Cumulative - HUD+CDC
screen_penetration
XOnly
252854.9
252855.1
252824.5
252864.5
252679.7
252586.4
252812.9
252861.2
243704.7
Linear
Time
252518.4
252510.3
252845.5
252670.1
252170.2
252213.5
252831.6
252590.5
243043.2**
Quadratic
Time
252551.1
252535.2
252503
252673.7
251830.5
252226.6
252573.5
252602.3
Categorical
Time
252528.4
252515.6
252491.2
252876.3
252679.3
251799.9
252202.3
252837
252591.5
243416.3
**
Variable factor(s) showed best fit when adjusted for degrees of freedom and were thus chosen to represent
parameter category in multivariate analysis.
Variable factors showed best fit; however, were not included in multivariate analysis because the time
categorical variable had less than ideal prediction properties.
Table 4-3. Summary of Exploratory Analysis Fit as shown by -2 Log Likelihoods for
Pr(PbB >= 15 ug/dL) Models
Parameter
Category
Income
Race
Housing
Cost
Occupancy
Single
Parent
Home Age
Variable Name
Median Family Income
Median HH Income
Median Per Capita Income
Pet HH No Earnings
Pct_HH_No_Wage
Pet HH Public Assist
Pct_LT_Poverty
Pet Family Income LT Poverty
Pet LE 5Yrs LT Poverty
Pet Asian
Pet Black
Pet White
Pet NHOPI
Pet Other Race
Pet Multi Race
Pct_Hispanic
Median Rent
Housing Value
Pet Rented
Pet Vacant
Pct_Single_Parent
Median Yr Built
Median Yr Occ Built
Pet Built Pre 1940
Pet Built Pre 1950
Pct_Built_Pre_1960
XOnly
292772.5
292732.9
292772.9
292673.9
293171.5
293148.6
293112.9
293290.6
293610.6
293448.5
293217.4
293165.6
293237.8
292358.3
292727.1
293905.5
293703.2
293863
292980.6
292999.8
292381.9
292404.2
Linear
Time
292744.3
292812.2
292756
292612.3
293226
293226.2
293338.3
292966.2**
293556.2
293343.9
293230
293130.1
293181.3
292076.9
292691.2
293721.5
293781.9**
292980.6
293001.5
292335.9
292364.6
292815.6
Quadratic
Time
292659.9
292590.8
292732.2
292449.8**
292889.5
292933.3
292898.4
292997.4
293176.7
293324.7
293334.4
293088.7
293035.3
293524.2
291971.3**
292732.6
293533.3**
293765
293008.9
292856.3
292195.2
292525.3
Categorical
Time
292689.4
292579.8
292766.8
292428.7
292310.0*
292588
292609.9
292547.8
292700.3
292343.9
293411.1
293311.7
293216.3
293085.7
292946.7
292091.3*
292091.4
292573.7
293491.3*
292998.6
292749.7
292161
292078.8
292400.7
37
-------
Parameter
Category
Home Age
Children
Education
Population
Air Lead
Tri
Variable Name
Pet Built Pre 1970
Pet Built Pre 1980
Pet Occ Built Pre 1940
Pet Occ Built Pre 1950
Pet Occ Built Pre 1960
Pet Occ Built Pre 1970
Pet Occ Built Pre 1980
Pet LE Six
Num LE Six
Pet LT 9th Grade
Pet No HS Degree
Pet No College
Pct_No_College_Degree
Total Housing Units
Total_Pop
Housing_Density
air_avg
air med
air_p95
air_avg_p95
air_med_p95
air_p95_p95
air_med_p99
air_avg_p99
air_med_p99
air_p95_p99
TRI Compounds air_fug
TRI Compounds air_tot
TRI Compounds air_stk
TRI Compounds under_inj
TRI Compounds water_surf
TRI Lead Only air_fug
TRI Lead Only airjot
TRI Lead Only air_stk
TRI Lead Only under_inj
TRI Lead Only water_surf
TRI Lead Total air_fug
TRI Lead Total air tot
TRI Lead Total air stk
TRI Lead Total under_inj
TRI Lead Total water surf
tri_asl_p95
tri_as2_p95
tri_as3_p95
tri_afl_p95
tri_af2_p95
tri_af3_p95
XOnly
293100
293086.5
292356.8
292842.2
293111.9
293214.5
293389
293199.4
292960
292557.4
292613.7
293444.1
293417.4
293273.5
293392.7
293410.8
293378.4
293522.6
293524.1
293421.4
293289.2**
293289.2**
293307
293312
293379.4
293280.9**
293302.9
293303.9
293339.1
293295.9
293361.2
293364.2
293300.6
293431.9
293447.9
293511.4
293346.9
293406.8
Linear
Time
293103.4
293089.8
292313.7
292357.3
293160.9
293281.3
293131.7
292951
292434.4
292501.1
293298.7
293297.9
293361.7
293382.5
293353.2
293462.9
293509.8**
293293.8
293293.8
293306.7
293329.5
293391.4
293373.4
293322.7
293327.8
293351.3
293317.4
293321.1
293334.7
293369.8
293370.5
293315.8
293326
293374.3
293468.1
293486.3
293341.6
293388.9
Quadratic
Time
292826.6
292851.2
292177
292139.4**
292546.1
292884
292911.5**
293584.6
293000.2
292701.2
292260.2**
292328.1
293466.2
293561.7
293267.2**
293366.1
293418.4
293316.8**
293467.7
292580.3*
293566.7
293304.7
293312.1
293304.7
293358.7
293416.4
293382.6
293325.9
293305.4
293317.6
293337.7
293331
293351.5
293355.6
293344.4
293350.3
293326.2
293184.4
293488.2
293217
Categorical
Time
292669
292684.6
292138.7
292071.1*
292418.1
292725.2
292711
292859.9
290311.1*
292491.7
292376
292152.4*
292254.9
290521.5*
293232.3
292995.5*
293112.1
292547.1*
292731.7
293262.7
293284.2
293262.7
293324
293339.6
293302.9
293193.9*
293240.5
293358.1
293362.7
293196.7
293251.3
293356.7
293369
293354.1
292325.9
293535.3
292046.2*
292195.8
38
-------
Parameter
Category
Funding
Variable Name
tri_atl_p95
tri_at2_p95
tri_at3_p95
tri wsl_p95
tri_ws2_p95
tri ws3_p95
tri_uil_p95
tri_ui2_p95
tri_ui3_p95
tri_asl_p99
tri_as2_p99
tri_as3_p99
tri_afl_p99
tri_af2_p99
tri_af3_p99
tri_atl_p99
tri_at2_p99
tri_at3_p99
tri_wsl_p99
tri_ws2_p99
tri_ws3_p99
tri_uil_p99
tri_ui2_p99
tri_ui3_p99
CDC cur Iag6
CDC cur Iagl2
CDC_cur_lagl8
CDC cur Iag24
CDC_cur_lag30
CDC_cur_lag36
HUD cur Iag6
HUD_cur_lagl2
HUD cur laglS
HUD_cur_lag24
HUD_cur_lag30
HUD cur Iag36
CDC_cum_lag6
CDC cum Iagl2
CDC_cum_lagl8
CDC_cum_lag24
CDC cum Iag30
CDC_cum_lag36
HUD cum Iag6
HUD_cum_lagl2
HUD_cum_lagl8
HUD cum Iag24
HUD_cum_lag30
XOnly
293445.5
293490.2
293524.3
293397.5
293449.3
293260.7
293260.7
293260.7
293300.7
293352.7
293317.7
293341.5
293319.2
293295.1
293359.2
293332.6
293304.9
293308.9
293260.7
293260.7
293269.5
293205.3
293218.4
293255.1
293274.7
293290.1
293253.2
293258.5
293253.7
293238.4
293208.4
293422.1
293393.4
293368.7
293310.8
293239.4
293234.4
293243.7
293257.1
293267.1
Linear
Time
293386
293434.4
293473.4
293384.9
293388.1
293433.9
293260.7
293260.7
293260.7
293299.9
293334.1
293342.5
293329
293311.8
293296.7
293347.3
293315.4
293338.4
293260.7
293260.7
292977.5
292865.8
292878.1
292961.8
293176.8
293206.1
293284.6
293249.9
293268.9
293257.1
293251.9
293194.5
293214.4
293207.1
293212.7
293218.2
293218
293205.4
293120.8
293134.7
293145.3
Quadratic
Time
293309.2
293188.8
293159
293682.5
293081.1**
293122.9
293101.5
293101.5
293101.5
293325.5
293339.9
293328.2
293303.8
293303.8
293315.4
293329.3
293222.9
293101.5
293101.5
293293.4
292865.8
292774.0**
292799.5
292890.1
293015.4
293076.1
293128.1
293149.3
293117.4
293203.1
293106.6
293190.6
293231.5
293228.4
293218.2
293145.3
293146.5
293163.9
293164.4
Categorical
Time
292314
292172.3
292238.7
293278.9
293153
293169
293016
293101.5
293129.5
293169.3
293153.9
293345.2
293282
293282.9
292890.8
292863.8
292857.9
292900.2
293003.6
293048.7
293287.2
293242.7
293251.9
293268.9
293186.5
293238
293218.9
293214.6
293207.6
293193.2
293150.1
293150.3
293148.9
293133.9
39
-------
Parameter
Category
Funding
Screening
Variable Name
HUD cum Iag3 6
tot_cur_lag6
tot cur lag 12
tot_cur_lag!8
tot cur Iag24
tot cur Iag30
tot_cur_lag36
tot cum Iag6
tot_cum_lagl2
tot cum laglS
tot cum Iag24
tot_cum_lag30
tot cum Iag36
HUD_cur
HUD_cum
CDC_cur
CDC_cum
Current -HUD+CDC
Cumulative - HUD+CDC
screen_penetration
XOnly
293278.3
293267.9
293260.9
293262.4
293257.8
293245.2
293206.8
293257.7
293257.7
293267.9
293275.4
293284.2
293269.1
293240.8
293450.7
293267.6
Linear
Time
293149.3
293280.2
293241.6
293261.9
293239.6
293258.1
293107.8
293100
293114.6
293128.8
293130.5
293287.6
293156.9
293231.1
293284.9
293135.6
Quadratic
Time
293146.1
293102.1
293127.1
293160.8
293049.9
293151.9
293124.1
293123.9
293142.1
293154.5
293153.2
293142.3
293172.6
293156.8
292881.8
293246.3
286744.8**
Categorical
Time
293110.8
293268.2
293218.6
293237.1
293215.4
293252.3
293164.7
293133.9
293138.5
293141.8
293117.5
293293
293193.2
292906.9
293256.8
293170.4
**
Variable factor(s) showed best fit when adjusted for degrees of freedom and were thus chosen to represent
parameter category in multivariate analysis.
Variable factors showed best fit; however, were not included in multivariate analysis because the time
categorical variable had less than ideal prediction properties.
Table 4-4. Summary of Exploratory Analysis Fit as shown by -2 Log Likelihoods for
Pr(PbB >= 25 ug/dL) Models
Parameter
Category
Income
Race
Housing
Variable Name
Median Family Income
Median HH Income
Median Per Capita Income
Pet HH No Earnings
Pct_HH_No_Wage
Pet HH Public Assist
Pct_LT_Poverty
Pet Family Income LT Poverty
Pet LE 5Yrs LT Poverty
Pet Asian
Pet Black
Pet White
Pet NHOPI
Pet Other Race
Pet Multi Race
Pct_Hispanic
Median Rent
XOnly
364225.4
364098.6
364504.4
363967.7
363920.7
364638.9
364611.6
364541.7
364863.6
364527.2
365490.4
365248.9
364698
364801.9
364789.2
363769.6
Linear Time
364192.5
364048
364426.7
363893.9**
364705.3
364741.3
364696
364929.7
364399.3**
365511.8
365038.9
364794.8
364678.7
364772.8
Quadratic
Time
364302.8
364139.1
364532.3
364071.9
363960.8
364847.1
364777.3
364856
364831.2
364703.1
363524.9**
Categorical
Time
364426
364252.5
364681.7
364002.6
363949.4
364637.4
364616.8
364527.4
364394.5*
365365.2
364770.1
364838.6
363876.3
40
-------
Parameter
Category
Cost
Occupancy
Single
Parent
Home Age
Children
Education
Population
Tri
Variable Name
Housing_Value
Pet Rented
Pet Vacant
Pct_Single_Parent
Median Yr Built
Median Yr Occ Built
Pet Built Pre 1940
Pet Built Pre 1950
Pet Built Pre 1960
Pet Built Pre 1970
Pet Built Pre 1980
Pet Occ Built Pre 1940
Pet Occ Built Pre 1950
Pet Occ Built Pre 1960
Pet Occ Built Pre 1970
Pet Occ Built Pre 1980
Pet LE Six
Num LE Six
Pet LT 9th Grade
Pct_No_HS_Degree
Pet No College
Pet No College Degree
Total_Housing_Units
Total_Pop
Housing_Density
air_avg
air med
air_p95
air_avg_p95
air_med_p95
air_p95_p95
air_med_p99
air_avg_p99
air_p95_p99
TRI Compounds air_fug
TRI Compounds air tot
TRI Compounds air_stk
TRI Compounds under_inj
TRI Compounds water_surf
TRI Lead Only air_fug
TRI Lead Only airjot
TRI Lead Only air_stk
TRI Lead Only under_inj
TRI Lead Only water_surf
TRI Lead Total air_fug
TRI Lead Total airjot
XOnly
364228.4
366381.1
365919.4
364858.6
364890.1
363573.4
363716.9
364677.4
365255
365157.7
363710.5
364710.4
365363.2
365200.7
364874.1
365297.1
364694.7
364365.8
363888.4
364172.5
365469.1
365377.3
364890.0**
365109.3
365149.3
365066.7
365506.4
365605.6**
365231.4
364910.2**
364950.2
364974.5
365029.5
365254.2
365170.3
364946
364940.3
364918.4
364950.5
364931.6
364934.6
364905.3**
Linear Time
364217.5
366003.6**
366351.7
365878.4**
364893.4
364927.1
363524.8
363694.8
364715.4
365322.4
365231.3
363502.8**
363694.8
365429.2
364636.7
363710.1**
363983.5
365172.2
364908.1
365098.9
365050.9**
365186.4
364949.7
364974.2
365019.8
365215.5
365147.7
364964.8
364958
364920.3
364975.5
364947.7
364960.9
364928.5
364996.6
Quadratic
Time
364266.9
366069.6
366647.4
365993.3
364874.1
365077.9
363636.7
363795.8
365404.2
365333.9
363615.6
363795.4
365509.7
365373.4
364787.6**
364676.4
364490.3
363897.3
364160.7
364921.6
365098.8
365142.8
365053.1
365444.5*
365205.3
364934.1
364959
364984.1
365176.2
365107.4
364980.3
364960.3
364992.6
364984.2
365013.8
Categorical
Time
364326.7
366361.5
366332.7
366022.9
364997.8
365041.9
363617.9
363754.9
364720.3
365310.9
365231.9
363591.1
364754.6
365420.4
365276
365003.5
364634.3
364359.9
363941.3
365116.6
364925.2
365022.8*
365028.1
365342.9
364912.5*
364957.4
365019.4
365259.6
365183.3
364994
365037.8
364976.7
365001.8
364995.3
365008.7
364977.9
365014.8
41
-------
Parameter
Category
Tri
Funding
Variable Name
TRI Lead Total air stk
TRI Lead Total under_inj
TRI Lead Total water surf
tri_asl_p95
tri_as2_p95
tri_as3_p95
tri_afl_p95
tri_af2_p95
tri_af3_p95
tri_atl_p95
tri_at2_p95
tri_wsl_p95
tri_ws2_p95
tri_ws3_p95
tri_uil_p95
tri_ui2_p95
tri_ui3_p95
tri_asl_p99
tri_as2_p99
tri_as3_p99
tri_afl_p99
tri_af2_p99
tri_af3_p99
tri_atl_p99
tri_at2_p99
tri_at3_p99
tri_wsl_p99
tri_ws2_p99
tri_ws3_p99
tri_uil_p99
tri_ui2_p99
tri_ui3_p99
CDC cur Iag6
CDC_cur_lagl2
CDC_cur_lagl8
CDC cur Iag24
CDC_cur_lag30
CDC cur Iag36
HUD_cur_lag6
HUD_cur_lagl2
HUD cur laglS
HUD_cur_lag24
HUD cur Iag30
HUD_cur_lag36
CDC_cum_lag6
CDC cum Iagl2
CDC_cum_lagl8
XOnly
365012.9
364926.7
364946.9
365325.7
365559.3
365113.2
365283.2
365406.5
365435.5
365307.8
365239.4
365338.3
364871.7
364871.7
364871.7
364965.4
365027.6
365079.1
365068
365053.7
365022.6
365048.8
365034.9
364999.7
364994.9
364991.4
364871.7
364871.7
365029.5
365008.6
364967
364989
364937.9
364893.3
364916
364850.4
364824.3
364796.1
365040.7
365006.7
364981.8
Linear Time
365029.9
364942.8
364973.4
365249.4
365223.1
365489.7
365498
365113.5
365267
365320.9
365358.5
365230.3
365316.1
364871.7
364871.7
364871.7
364961.3
365021.2
365059.8
365051.1
365042.3
365006.5
365036.8
365011.9
364972.1
364977.9
364981.6
364871.7
364871.7
364867.2
364956.9
364858.7
364840.3
364897
364919.4
364894.9
364927.7
364889.1
364869.8
364863.1
364841.4
364782.8
365009.7
365000.8
Quadratic
Time
365049.1
364980
365004.6
365249.7
365193.4
365459.1
365506.5
365225.7
365306.8
365319.4
365270.9
365195
365285.3
365011.6
365011.6
365011.6
364961.2
365024.1
365067.5
365038.6
365010.2
365050
365019.8
364983
364985.8
364976.5
364951.1
365011.6
365011.6
364864.5**
364956.9
364871.8
364881.1
364960.1
364963.5
364991.3
364808.6
364804.6
364768.1**
364824
364789.4
364802.3
365084.8
365059.1
365047.7
Categorical
Time
365054.7
364990.4
365022.5
365152.7
365423.2
365620
365005.7
365145.6
365423.7
365270.2
365266.6
364990
364990
364990
364964.3
364974.7
365020.9
364985.1
364974
364994.8
365005.5
364990.4
364993.6
364990
364990
364857.2*
364945.6
364857.2
364852
364906.9
364904.2
364920
364900.9
364891.8
364768.8*
365040
365027.3
365021.5
42
-------
Parameter
Category
Funding
Screening
Variable Name
CDC cum Iag24
CDC_cum_lag30
CDC cum Iag36
HUD_cum_lag6
HUD cum Iagl2
HUD cum laglS
HUD_cum_lag24
HUD cum Iag30
HUD_cum_lag36
tot cur Iag6
tot cur Iagl2
tot_cur_lag!8
tot cur Iag24
tot_cur_lag30
tot cur Iag36
tot cum Iag6
tot_cum_lagl2
tot cum lag 18
tot_cum_lag24
tot cum Iag30
tot cum Iag36
HUD_cur
HUD_cum
CDC_cur
CDC_cum
Current -HUD+CDC
Cumulative - HUD+CDC
screen_penetration
XOnly
364957
364934.4
364920.3
364860.6
364850.7
364851.8
364867.1
364877.2
364892.6
364930.2
364862.5
364862
364830.9
364793.7
364890.4
364876.1
364873.5
364884.2
364889.6
364902.3
364930.7
364883.2
365027.8
365081.4
364946.5
360376
Linear Time
364993.3
364984.2
364978
364793.5
364788
364791.1
364807.6
364826.6
364941.5
364872.6
364873
364848.9
364783.3
364819.5
364814
364819
364833.1
364948
364815.7
364962.7
365028.2
364966.1
364843
360256.7**
Quadratic
Time
365033.1
365017.3
365004.7
364808.3
364807.5
364810.2
364816.1
364803.6
364809.5
364842.7
364840.7
364795.3
364853.4
364794.7
364816.2
364839.4
364833.9
364837.1
364846
364839.5
364849.2
364886.1
364969.9
365131.3
364919.2
364865.1
Categorical
Time
365011.9
364816.4
364794.2
364790.5
364772.6
364775.8
364926.8
364900
364887.3
364868.3
364776.2
364845.2
364836.1
364830.7
364832.5
364821.7
364829.3
364936.8
364835.9
364980.6
365060.7
364864.6
360321.8
** Variable factor(s) showed best fit when adjusted for degrees of freedom and were thus chosen to represent
parameter category in multivariate analysis.
* Variable factors showed best fit; however, were not included in multivariate analysis because the time
categorical variable had less than ideal prediction properties.
4.2 Relationship between Local Blood-Lead Data and Explanatory Variables
Many of the variables investigated for the National (Low Resolution) model also were explored
for the local modeling using Massachusetts data. All of the census data were used in both
models, although at the census-tract level rather than at the county level. The various
demographic, environmental, and programmatic variables were explored using the same
techniques as the national data, which were described in Section 2.2. Detailed figures and tables
containing exploratory results are included in Appendix B. A detailed discussion of the results
seen in Appendix B is contained in Appendix E. Table 4-5 presents the log-likelihood statistics
that resulted from the bivariate modeling. Variables presenting the best model fit within each
variable category are highlighted in yellow.
43
-------
Table 4-5. Summary of Log-likelihood Ratios from each Model Fit to all Potential Explanatory Variables, Massachusetts Data
Variable Category
Income
Race
Housing Costs
Occupancy
Single Parent
Home Age
Variable
Median Family Income ($)
Median Household Income ($)
Median Per Capita Income ($)
Percent No Household Earnings
Percent No Household Wage
Percent Household on Public Assistance
Percent Below Poverty Line
Percent Family Income Below Poverty Line
Percent Less than 5 Years in Poverty
Percent Amer. Indian and Alaskan Native Alone
Percent Asian Alone
Percent Black Alone
Percent White Alone
Percent Native Hawaiian and Other Pacific Islander Alone
Percent Other Race Alone
Percent Multiple Races
Percent Hispanic
Median Rent ($)
Housing Value ($)
Percent Rented
Percent Vacant
Percent Single Parent
Year Built
Year Occupied Unit Built
Percent Built Before 1940
Percent Built Before 1950
Percent Built Before 1960
Percent Built Before 1970
Percent Built Before 1980
Percent Occupied Units Built Before 1940
Percent Occupied Units Built Before 1950
Percent Occupied Units Built Before 1960
Model 1
51727.2
51689.2
51917.2
52020.1
52039.6
51963.8
51974.1
51991.3
52025.2
52259.2
52264.2
52170.1
52123.5
52273.4
52202.6
52042.2
52181.3
52004.0
52094.5
52006.7
52186.5
51747.7
51949.7
51966.8
51923.9
51897.9
51959.8
52052.3
52073.4
51931.0
51910.8
51975.9
Model 2
48154.7
48114.6
48345.5
48445.8
48467.3
48385.7
48389.9
48412.1
48446.0
48692.5
48699.0
48599.1
48547.4
48706.6
48633.4
48470.6
48609.2
48440.1
48520.9
48421.1
48617.5
48155.6
48360.5
48377.9
48335.3
48308.1
48374.0
48469.4
48490.9
48342.4
48321.2
48389.9
Model 3
86501.8
86375.5
86812.3
86853.4
86858.3
86854.4
86778.4
86847.5
86924.8
87120.6
87139.8
87051.0
86960.9
87135.6
87055.1
86889.1
87019.6
86942.8
87032.6
86681.5
87003.7
86542.8
86621.1
86641.4
86505.1
86483.1
86653.5
86806.3
86850.4
86512.2
86497.8
86675.0
Model 4
139627.0
139433.1
140036.2
139531.9
139459.8
139836.6
139558.4
139686.4
139739.5
139781.5
139784.2
139740.3
139688.4
139661.3
139808.6
139972.2
140053.9
139426.7
139451.9
139654.1
139739.2
139748.6
139476.8
139547.1
139697.8
139775.8
139771.3
139475.5
139549.3
139713.0
Model 5
178071.4
177919.9
178467.2
177346.7
177184.8
178218.1
177838.6
177952.4
178034.4
177392.1
177384.8
177691.4
178066.6
177378.7
177606.3
178104.6
177754.4
177952.2
178202.0
177818.5
177021.2
178338.5
178275.2
178258.9
178110.9
178188.5
178061.5
177977.1
177896.0
178078.2
178149.3
178044.5
44
-------
Variable Category
Children
Education
Population
Air
HUD Funding
Variable
Percent Occupied Units Built Before 1970
Percent Occupied Units Built Before 1980
Percent Less than 6 Years of Age
Number Less than 6 Years of Age
Percent Less than 9th Grade
Percent without High School Degree
Percent without any College
Percent without College Degree
Total Housing Units
Total Population
Housing Density
Air Dispersion (ASPEN) Model
Air Exposure (HAPEM5) Model
Air Hazard Quotient (HQ)
Current HUD Funding ($ per Child)
Cumulative HUD Funding ($ per Child)
Current State Funding ($ per Child)
Cumulative State Funding ($ per Child)
Current CDC Funding ($ per Child)
Cumulative CDC Funding ($ per Child)
Current Total Funding ($ per Child)
Cumulative Total Funding ($ per Child)
Current HUD Funding ($ per Census Tract)
Cumulative HUD Funding ($ per Census Tract)
Current State Funding ($ per Census Tract)
Cumulative State Funding ($ per Census Tract)
Current CDC Funding ($ per Census Tract)
Cumulative CDC Funding ($ per Census Tract)
Current Total Funding ($ per Census Tract)
Cumulative Total Funding ($ per Census Tract)
Model 1
52070.4
52083.6
52280.4
52243.2
52006.5
51852.7
51787.2
51822.1
52286.9
52223.0
52272.6
52275.1
52273.1
52272.3
52290.9
52287.3
52162.4
52200.7
52288.3
52292.6
52282.8
52292.2
52302.9
52306.8
52232.2
52210.6
52297.5
52303.7
52303.1
52300.1
Model 2
48486.2
48501.0
48713.7
48678.2
48435.7
48279.9
48218.2
48251.2
48720.8
48660.6
48697.2
48708.3
48706.3
48705.5
48722.4
48723.3
48582.7
48617.7
48720.8
48725.0
48711.9
48721.9
48737.1
48740.4
48659.5
48638.3
48729.5
48737.2
48735.6
48732.3
Model 3
86829.5
86860.0
87126.9
86913.2
86828.7
86709.8
86729.0
86759.6
87090.7
86909.7
87068.2
87136.9
87134.9
87134.0
87140.4
87163.4
87014.3
87044.0
87151.7
87136.0
87145.9
87167.5
87097.6
87178.1
87222.7
87162.7
87121.6
87173.9
87179.7
Model 4
139799.3
139776.7
138995.8
139726.7
139857.7
140104.0
140084.9
139461.7
138948.0
139579.8
139740.1
139737.8
139737.0
139755.6
139804.9
139706.6
139740.2
139760.5
139706.1
139723.5
140076.2
140117.0
139570.8
139790.6
139780.5
Model 5
177979.1
177867.1
177615.0
177859.2
178346.7
178689.0
178625.6
176701.5
177260.2
177377.1
177375.0
177374.2
177426.7
177444.3
177503.7
177459.9
177456.0
177330.9
177377.3
177172.3
177127.9
178005.1
178018.7
177462.6
176972.0
177370.8
177447.9
45
-------
Variable Category
TRI
Housing Inspection
Variable
TRI Compounds (Total Air)
TRI Compounds (Fugitive Air)
TRI Compounds (Stacks)
TRI Compounds (Water Surface)
TRI Lead Only (Total Air)
TRI Lead Only (Fugitive Air)
TRI Lead Only (Stacks)
TRI Lead Only (Water Surface)
TRI Total Lead (Total Air)
TRI Total Lead (Fugitive Air)
TRI Total Lead (Stacks)
TRI Total Lead (Water Surface)
P 1 : Proportion of Housing Units Passing MA Standard of Care :
Naive Method 1
Fl : Proportion of Housing Units Failing MA Standard of Care:
Naive Method 1
Nl: Proportion of Housing Units Assessed: Naive Method 1
P2: Proportion of Housing Units Passing MA Standard of Care:
Naive Method 2
F2: Proportion of Housing Units Failing MA Standard of Care:
Naive Method 2
N2: Proportion of Housing Units Assessed: Naive Method 2
P3 : Proportion of Housing Units Passing MA Standard of Care:
Naive Method 3
F3 : Proportion of Housing Units Failing MA Standard of Care:
Naive Method 3
N3: Proportion of Housing Units Assessed: Naive Method 3
P4: Proportion of Housing Units Passing MA Standard of Care:
MDPH Method
F4: Proportion of Housing Units Failing MA Standard of Care:
MDPH Method
N4: Proportion of Housing Units Assessed: MDPH Method
Model 1
52295.5
52287.7
52295.6
52285.5
52291.0
52289.6
52291.7
52274.4
52295.8
52291.5
52295.6
52285.4
52214.0
52108.5
52131.7
52240.8
52199.5
52208.4
52240.8
52108.5
52160.8
52240.5
52106.5
52160.3
Model 2
48728.8
48720.7
48728.9
48718.8
48724.0
48722.6
48725.0
48708.5
48729.0
48724.6
48728.9
48718.6
48643.0
48548.2
48554.8
48671.5
48645.5
48641.6
48671.5
48548.2
48586.2
48671.1
48545.8
48585.5
Model 3
87153.5
87146.1
87154.3
87147.0
87153.1
87151.4
87153.7
87138.1
87155.2
87154.0
87154.3
87147.4
87070.2
86916.6
86963.0
87101.3
87046.4
87066.2
87101.3
86916.6
86996.1
87098.8
86919.8
86994.8
Model 4
139761.4
139750.0
139761.8
139753.0
139762.9
139759.1
139757.2
139746.5
139759.7
139761.5
139760.6
139753.3
139774.6
139803.2
139849.2
139767.8
139861.5
139826.1
139767.8
139803.2
139865.4
139769.1
139808.9
Model 5
177395.8
177399.8
177395.5
177391.9
177405.1
177404.2
177390.5
177386.0
177392.4
177401.9
177393.4
177393.0
177816.1
178520.9
178224.8
177800.8
178375.2
178110.9
177800.8
178520.9
178257.4
177809.5
178518.4
178259.8
46
-------
5.0 STATISTICAL MODELING RESULTS
As described in Section 2.3, for each statistical model within each of the two broad model types
(Low and High Resolution) the variables that led to the best model fits were initially included in
a multivariate statistical model and assessed jointly to determine which variables were predictive
of children's blood-lead levels. If higher order interactions with time were not significant within
the multivariate model and did not negatively impact the fit of the model upon removal, they
were subsequently removed. As results of each model run were reviewed, some variables were
dropped from the model if they were not significant predictors of the outcome variable and were
not improving the fit of the model by being included. Thus, each model was run and results were
assessed multiple times until a final model was reached. The sections below present the final
model results for the national risk models (Section 5.1) and the local risk models for
Massachusetts (Section 5.2). Maps of the predicted results are included in Section 6 and in
Appendix G.
5.1 Low-Resolution Modeling Results
Table 5-1 presents the full set of variables included in the final multivariate models for Models 1
through 4. Across all four models, the time and space variables were important predictors of the
various outcomes. The same three variables related to time and space were included in all four
models:
• EPA region
• the interaction between EPA region and a continuous measure of time (in years,
centered at the year 2000)
• the interaction of EPA region and quarter of the year with the 3rd quarter (July-
September) associated with the highest predicted lead levels.
Notes on the other variable types explored and a summarization of the set of variables included
in the final models are presented below.
• Income - Percent of Units with No Household Wages was included in Models 1 to 3 with
all interactions with time included. Percent of Units with No Household Earnings was
included in all Model 4 although the interaction with time squared was dropped.
• Race - Percent Black was included in Model 1, Percent Multiple Races in Model 2, and
Percent Asian in Models 3 and 4, although the interaction with time squared was dropped
in Model 3 and both interactions were dropped in Model 4. The best-fitting race
variables were included in Models 1 through 5.
• Housing Cost - Median Rent was only included in all models and all interaction terms
appeared to be strong predictors.
• Occupancy - Percent Vacant was included in Model 1 and Percent Rented in the other
three. The interaction with time squared was dropped in Models 2 and 4.
• Single Parent Status - The percent of single parent households was included in all
models with the interaction with time squared was dropped in Models 3 and 4.
• Housing Age - Percent Built Pre-1960 was included in Models 1 and 2, Percent Built Pre-
1950 in Model 3, and Percent Built Pre-1940 in Model 4.
47
-------
• Children's Age - The percent of children less than six years old was included in all
models, although the p-values for each term in Model 2 are high.
• Education Level - Percent Without a College Degree was included in all models,
although the interaction with time squared was dropped in Model 3 and both interaction
terms were dropped in Model 4.
• Population - Total Housing Units was included in Model 1 without either interaction
with time. Total Population was included in Model 2 with both interactions. Housing
Density was included in Models 3 and 4, although the interaction terms were dropped
from Model 4.
• Air Lead- Median Air Lead, 99th Percentile was included in Models 1, 2, and 4, although
the interaction terms were dropped from Models 3 and 4. Air Lead 9th Percentile was
included in Model 2.
• TRI -TRI Lead Total Air, 95th Percentile was included in Model 1 with all interaction
terms. TRI Lead Water Surface 95th Percentile was included in Models 2 and 3 with both
interaction terms. TRI Lead Underwater Injection 95th Percentile was included in Model
4 but the interaction with time squared was dropped.
• Drinking Water - The two Mean Water Lead Concentration variables were included in
each model, although the interaction with time squared was dropped in Model 3 and both
interactions were dropped in Model 4.
• Funding- Total Cumulative Funding 36-month Time Lag was included in Model 1 with
both interactions with time. Current CDC Funding 12-month Time Lag was included in
Models 2 and 3 with both interaction terms. Current HUD Funding 12-month Time Lag
was included in Model 4 with all terms being significant.
• Screening - Screening penetration was included in each model, although the interaction
with time squared was dropped in Models 2 and 4.
Thus, in Model 4 for probability of blood-lead level > 25, most of the interactions with time
squared were dropped from the model and a number of the interactions with time were dropped
as well. Note that when the interaction with time and/or time squared were significant or
improved the model, the lower order terms were kept in the model even if a particular term had a
p-value above 0.05.
Tables 5-2 through 5-7 present the parameter estimates from each of the four multivariate
national models. The standard error and p-value associated with each predictor also is included.
Estimates also are presented for the three variance components that were included in the national
models -
-------
The second set of figures plot the observed values versus the predicted values for each model. If
the multivariate model fitted is appropriate, predicted values obtained from regressing the
observed values on the multivariate model's predicted values when plotted against observed
values, one would expect all the points to be very close to the 45° line. Figures 5-2, 5-4, 5-6, and
5-8 contain these comparison plots for each of the four national models, respectively. The plots
were conducted on both the observed probability and logit probability scales, with observed data
points at zero and one censored in the logit scale plots.
In general, these plots suggest that the models are performing well. A weighted regression line
(blue line) fit to the observed versus predicted plots shows a very high R2 value in most of the
models that mirrors the 45° line (shown in red) for the majority of the data. One trend observed
in these plots that is important to consider is that the Broad-Based National Models tend to
under-predict for county/quarter combinations with higher proportions that exceed the 5, 10, 15
and 25 ug/dL threshold values. Further exploration may be necessary to determine whether these
higher values represent county/quarter combinations with fairly sparse data (i.e., few
observations) - which might explain why they would have been less influential because the
models are influenced by the number of observations associated with each observed value. For
the higher blood-lead threshold categories, the model appears to over-predict the lower observed
proportions - suggesting the possibility of a regression to the mean effect.
49
-------
Table 5-1. Summary of Variables Included in Final National Multivariate Model
Variable Type
Area and Time
Income
Race
Housing Cost
Occupancy
Family
Structure
Housing Age
Children's Age
Education
Population
Air Lead
TRI
Drinking Water
Lead
Funding
Screening
Model 1
Region
Time* Region
Region* Quarter
Percent of Units No
HH Wages
Percent Black
Median Rent
Percent Vacant
Percent Single Parent
Percent Built Pre-
1960
Percent < Six Years
Old
Percent without
College Degree
Total Housing Units
Median Air Lead,
99th percentile
TRI Lead Total Air,
95th percentile
Mean Water Lead
(water=l)
Mean Water Lead
(water=2)
Total Cumulative
Funding 36-month
Time Lag
Screening Penetration
Model 2
Region
Time* Region
Region* Quarter
Percent of Units
No HH Wages
Percent Multiple
Races
Median Rent
Percent Rented
Percent Single
Parent
Percent Built Pre-
1960
Percent < Six
Years Old
Percent No College
Total Population
Air Lead, 95th
percentile
TRI Lead Water
Surface 95th,
percentile
Mean Water Lead
(water=l)
Mean Water Lead
(water=2)
Current CDC
Funding 12-month
Time Lag
Screening
Penetration
Model 3
Region
Time* Region
Region* Quarter
Percent of Units
No HH Wages
Percent Asian
Median Rent
Percent Rented
Percent Single
Parent
Percent Occupied
Built Pre- 1950
Percent < Six
Years Old
Percent No College
Housing Density
Median Air 99th
percentile
TRI Lead Water
Surface 95th,
percentile
Mean Water Lead
(water=l)
Mean Water Lead
(water=2)
Current CDC
Funding 12-month
Time Lag
Screening
Penetration
Model 4
Region
Time* Region
Region* Quarter
Percent of Units No
HH Earnings
Percent Asian
Median Rent
Percent Rented
Percent Single
Parent
Percent Occupied
Built Pre- 1940
Percent < Six Years
Old
Percent No College
Housing Density
Median Air 99th
percentile
TRI Lead UI 95th,
percentile
Mean Water Lead
(water=l)
Mean Water Lead
(water=2)
Current HUD
Funding, 12-month
Time Lag
Screening
Penetration
50
-------
Table 5-2. Model 1 (Proportion > 5 ng/dL) Parameter Estimates for Multivariate National Model
Region 1
Effect
Region
Time*Region
Region*Quaiter-i
Region *Quaiter-2
Region*Quaiter-3
Region*Quarte!-4
Estimate
-2 200
0 ! 15
-0.149
-0.054
0243
0000
StdErr
0.291
0060
0.008
0008
0007
P-V'alue
DM1,1
0.0546
Mt.-Ml
1 , M|
i.'.'-l
Region •!
Effect
Region
Time'Region
Region*Quaiter-l
Region*Quarter-2
Region*Quarter-3
Region*Quarter-4
Estimate
-2.003
-0 065
-0 195
•0038
0068
0 000
StdErr
0280
0058
0 006
0 006
0006
P-V»lue
•;, I'l
02628
,., ...
i
LI LI
Region 7
Effect
Region
Time*RegKm
Region*Qtia:ter-l
Reglon*Quarter-2
Region*Quarter-3
Re.gion'Quaitc.i-4
Estimate
-2 406
0048
-0291
-0006
0 184
0000
StdErr
0279
0 057
0009
0 009
0 008
P-Value
,„ ,•]
04064
l'l:,,|
0 5220
i..,.j i
Region 10
Effect
Region
Time'Region
Region*Quarter-l
Region'Quanci -2
Region*Quai1er-3
Rcgion*Quat1er-4
Estimate
-2916
0 052
-0 325
-0 196
0113
0000
StdErr
0311
0065
0.052
0050
0 049
P-Value
..l",|
0.4231
• " .: ' ! ; •' i
Variance Components
Effect
UN( 1 , 1 )
UN(2,1)
UN(2,2)
Estimate
0210
-0020
0009
SldErr
0007
0 00 1
0000
P-Value
Region 2
Effect
Region
Time'Region
Reg!On*Quaitti-l
Region*Quartei-2
Region"QuarttT-3
Rcgion'Qiiarter-4
Estimate
-2 567
0 114
-0 256
-0 142
0 209
0000
SlilErr
0299
0061
0.005
0005
0 005
P-Value
,M><
Region 3
Estimate
Region 5
Effect
Region
Time'Region
Region*Qtiancr-l
Region*Quar!cr-2
Region*Quarter-3
Region'Quarter-4
Estimate
-2 546
0018
-0 334
-0117
0280
0000
StdErr
0 284
0058
0 004
0004
0 004
P-Value
0758996
L- ,
i.,.,.;
Region 6
Effect
Region
Time*Region
Region'Qnaner-1
Region*Quaitei-2
Region"Quartcr-3
Kegion*Quaiter-4
Estimate
-2 196
0016
-0.114
-0018
0.102
0.000
StdErr
0 283
0058
0 008
0008
0 008
P-Value
i...,,|
07801
,.. .,|
Region 8
Effect
Region
Turn:* Region
Region*Quaiiter- 1
Reg,on'Quaner-2
Rcgion'Quarter-3
Rcgion*Quaitci-4
Estimate
-2 621
0 076
-0 277
-0 250
0057
0 000
StdErr
0285
0 059
0057
0055
0 OS 1
P-Value
,,.,.,
02013
0 2686
Region 9
Effect
Region
Time'Region
Region'Quarter-l
Region*Quarter-2
Region*Quartei'-3
Re.gion'Quartei-4
Estimate
-2 125
0.118
-0016
-0214
-0006
0.000
StdErr
0318
0 066
0021
0 020
0019
P-Value
0 0749
04596
0 7635
Olher Predictors
Effect
Screening Penetration
Pet Units Built Before
1960
TRI Lead Total Air >
95th Percentile
Median Rent
Total Cumulative
Funding: 36-Month Lag
Pet of Residents Without
CoilegeDegree
Pet Units with No
Household Wage
Pet <- 6 Yrs of Age
Pet Single Parent
Pet Black Alone
Pet Vacant Units
Mean Water Lead
Cone, (water -1)
Mean Water Lead
Cone (water ---2)
Median Air a 99th
Percentile
Total Housing Units
X
Kst.
-1.010
1 792
0.250
-0.039
-0.002
0.544
0.270
-1.569
0.139
0,697
0,102
0,056
-0 151
0307
-0.001
StdErr
0066
0.109
0.051
0.019
0.000
0.254
0.300
1.094
0.243
0.125
0.161
0.011
0.027
0 100
0 00 1
P-Val
O.0001
<0.0001
O.OOOl
11 1141,5
O.OOOl
ii o.->3i
0.3687
0.1515
0.5662
O.OOOl
0.5258
O.OOOl
O.OOOl
0.0022
0.6118
X*Time
Est.
O.I 17
-0.209
-0.013
-0.013
0.001
0.077
-0 120
-0,449
-0.101
0.072
0 004
0.002
-0013
0.012
StdErr
0.014
0023
0.010
0.004
0.000
0,053
0.064
0.237
0.052
0.027
0.035
0 004
0010
0010
P-Val
O.OOOl
O.OOOl
0 1944
00009
O.OOOl
0.1431
0.0592
0.0585
0.0511
0 0069
0.9063
0,6776
0 ! 800
0.2031
X*Timc2
Est.
-0.087
0.026
0.000
-0.002
-0.0002
•0,014
0,005
0,061
-0.002
0.003
-0.012
-0.003
0.013
-0 00 1
StdErr
0 004
0.001
0000
0.000
O.OOOl
0,002
0.004
0.015
0.002
0.001
0,003
0 00 1
0.002
0 000
P-Val
O0001
O.OOOl
O.OOOl
<0 000 1
O.OOOl
O.OOOl
0.2343
0.0001
0.3862
0 0409
O.OOOl
O.OOOl
O.OOOl
0 0007
51
-------
22.5
20.0
17.5
15.0
12.5
10.0
7.5 -
5.0
2.5
0 T-—T •••!••••!•••• • • • '|' • • V ' ' V ' ' V ' ' I' • • •!• ............ .,. .
-0.84 -0.72 -0.6 -0.48 -0.36 -0.24 -0.12 0 0.12 0.24 0.36 0.48 0.6 0.72 0.84
Residuals from Model-3 (P5)
Figure 5-1. Histograms of Residuals from Fitted National Multivariate Model 1
i. 0.6-
R2 from Fitted Regression=0.857
Observed (P5)
Figure 5-2a. Plot of National Multivariate Model Predicted Values versus Observed with
Fitted Regression Line and 45° Reference Line for Proportion of Children
with BLL > 5 ug/dL
5
I 4
» 3
R2 from Fitted Regression 0.840
-6-5-4-3-2-1
Observed (P5 - Logit Scale)
Figure 5-2b. Plot of National Multivariate Model Predicted Values versus Observed with
Fitted Regression Line and 45° Reference Line for Proportion of Children
with BLL > 5 ug/dL (Logit Scale)
52
-------
Table 5-3. Model 2 (Proportion > 10 jug/dL) Parameter Estimates for Multivariate National Model
Region I
Effect
Region
Time'Region
Region *Quaitef-l
Region*Quarter-2
Region *Quartcr-3
Region 'Quaiter-4
Estimate
Ji973
0 I 17
-0 ] 95
-0012
0342
0000
SldErr
0 277
0 05 i
OOIS
OOI5
00 14
P-Vnlue
'.' ''i
0 3983
,.,,.,
Region 4
Effect
Region
Tinie'Region
Region*Qiiaiter-l
iion°Quaiter-3
Reglon"Quaiter-4
Estimate
-5 505
0 185
-0 354
-0 145
0.304
0000
SldErr
0286
0053
0.012
001 1
0010
P-Value
Region 3
Estimate
Region 5
Effect
Reaion
Tmre'Rcgion
Region*Qiiaiter- 1
Region'Quartei-2
Region'Quaita-3
Regmn'Quaitcr-4
Estimate
-5417
0 046
-0467
-0.13S
0 370
0000
StdErr
0259
0048
0 007
0007
0 006
P-Value
O 344^
Region 6
Effect
Region
I ime*Region
Reg!on"Quarler-l
RegionJQuisilei-2
RegionTQuaiier-3
RegionTQLi;siEer^l
Estimjite
-5 175
0098
-0 166
-0.0-18
0090
0.000
StdErr
0 264
0 050
OOI9
OOIS
OOIS
P-Value
Region 8
Effect
Region
1 ime*Region
Region*Quarter-l
Region*Quarter-2
Region'Quarler-3
Region *Quancr4
Estimate
-5 624
0042
-0 482
-0 164
-0 06 1
0000
StdErr
0 286
0 056
0 136
0 120
0113
P-Value
0 4465
0 1734
0 5930
Region 9
Effect
Region
1 ime'Reglcit
Region*Quar!cr-i
Region*Quarter-2
Region'Quiirtei-3
Region*Qusrter-4
Estimate
-4.747
0.249
0 088
-0 108
0050
0000
StdErr
0310
0 059
0 03 5
0031
0032
P-Value
-
: i
0 1223
Effect
Wage
s ot Age
Other Predictors
Kst.
1 049
StdErr
P-Val
=0.000 1
X* Time
Est.
0.019
-0071
-0.086
StdErr
0.010
I'-Val
0001
0.3791
X* Ttme2
Est.
StdErr
0.001
<0 0001
53
-------
30
25
10
5
0 —T '••'!•••• T I I1 '~Trm| I I I' ' T ' T
-0.2775 -0.1875 -0.0975 -0.0075 0.0825 0.1725 0.2625 0.3525 0.4425 0.5325 0.6225
Residuals from Model-4(P10)
Figure 5-3. Histograms of Residuals from Fitted National Multivariate Model 2
R2 from Fitted Regression 0.896
\
0.4
Observed (P10)
Figure 5-4a. Plot of National Multivariate Model Predicted Values versus Observed with Fitted
Regression Line and 45° Reference Line for Proportion of Children with BLL > 10 ug/dL.
2-
R2 from Fitted Regression= 0.867
~
-4 -2
Observed (P10 - Log it Scale)
Figure 5-4b. Plot of National Multivariate Model Predicted Values versus Observed with Fitted
Regression Line and 45° Reference Line for Proportion of Children with BLL > 10 ug/dL
(Logit Scale)
54
-------
Table 5-4. Model 3 (Proportion > 15 ng/dL) Parameter Estimates for Multivariate National Model
Region 1
Effect
Region
Time*Region
Region'Quarter-!
Region'Quarter-2
Region "Quart er-3
Region "Quarterns
Estimate
-7800
0-186
-0.197
0.02S
04! 1
0.000
SldErr
0.339
0.066
0.027
0.025
0023
P-Valllt
' UUIJI
ftiit'MMS
- UuO 1
0265927
- OOlil
Region 4
Effect
Region
Time*Region
Region*Quarter-l
Region*Quaner-2
Region*Quaner-3
Region *QuarteM
Eitlmite
-7743
0.144
-0.262
0.089
0 169
0.000
SldErr
0316
0062
0.029
0.027
0026
P-Value
• uuul
u u:u-Jv
• liddl
U UUOS^
Ui'UI
Region 7
Effect
Region
Time'Region
Region"Quar[er-l
RegionKQuailer-2
Region*Quarter-3
Region*Qyarter^4
Estimate
-7750
O.I 39
-0.492
0.079
0303
0.000
StdErr
03I7
0063
0.027
0,024
0,023
P-Value
• t'iml
'..' U2<>t*si
i,t,0l
(> Uulu4i
IJUli !
Region 10
Effect
Region
Time*Region
Region*Quarter-i
Region*Quadcr-2
Regio5i*Quarler-3
Region*Qt!ancrjl
Eilimale
-8.170
0.172
-0892
-0.350
0.019
0.000
SldErr
0393
0079
0 186
0 156
0 147
P-Valne
Uuril
mess.!:
Will
tit.OM'IS
0895229
Variance Components
Effect
UN(I,1)
UN(2.I)
UN(2,2)
Eitlmite
0.365
-0.018
0009
SldErr
0016
0002
0001
P-Value
Region 2
Effect
Region
Time*Region
Region*Quarter- 1
Region*Quarter-2
Region*Quarter-3
Xegion*Quarter-4
Knirnate
-8243
0.262
-0.352
-0.073
0373
0000
SldErr
0.346
0068
0.021
0019
CO) 8
P-Value
' IHII
d Him] : •
tifMj
DM I U 1
' I 1 1
Region 3
Effect
Region
Time*Region
Region*Quarter- 1
Reg3Qn*Qtianef-2
Region*Quarler-3
Region*Quarter-4
Estimate
-7620
0.097
-0.361
-0.093
0.279
0.000
StdErr
0.325
0.064
0.021
0020
0018
P-V»lue
- own
0 13021
••' ODol
OlK.ll
' OOf'l
Region 5
Effect
Region
Time*Region
Reg:on*Quarter-l
Regio!i*Quaner-2
Region "Quarter- 3
Region*QuaneM
Estimate
-7958
0.118
-0.512
-0.106
0.440
0.000
SldErr
0.315
0062
0011
0.010
0.009
P-Value
umil
0057593
• Will
Uf!H
OHUJ
Region 6
Effect
Region
Time*Region
Region*Quaner-l
Region*Quarler-2
Region*Qua!1er-3
Region*Quarter^4
Estimate
-7 702
0.186
-O.I 46
-0089
0.032
0.000
SldErr
0 I
OOC!
oo
oo
00 I
P-Value
I
I
0 (4T 6
Region »
Effecl
Region
Time*Region
Region*Quaner-i
Region*Quaner-2
Region *Quarter-3
Region*Quaner-4
Etliraale
-8 1 92
0. 12I
-0730
-0 1 04
-0.093
0000
SldErr
0363
0.075
0.24 1
0 193
0.185
P-Valne
. IB III |
0 104597
nc.,..a.i,,i
0589189
0.613293
Region 9
Effect
Region
Time*Region
Re.gion*Quarter-J
RegionTQuaner-2
Region*Quarter-3
Region*Quaner^
Estimate
-6.830
0 149
-0.171
-0.220
0016
0.000
StdErr
0.382
0077
0.049
0.046
0045
P-Value
l'IA.r|
0.053885
f.i ouu J*-l
UUUI
0.722024
Other Predictors
Effect
Screening Penetration
Median Rent
Pet. Occupied Units Built
Before 1950
Pet of Residents No College
Pet Units with No Household
Wage
Current CDC funding: 12-
Month Time Lag
Pet <6 Yrsof Age
Pet Asian AJone
FRI Lead Water Surface > 95th
Percentiie
Mean Water Lead Cone. (water
= 1)
Mean Water Lead Cone, (water
=2)
Housing Density
Median Air > 99th Percentiie
Pet, Rented Units
Pet Single Parent
Est.
-4,185
0.178
2.993
0,799
1.400
0.026
3.599
-2.253
0,223
-0.025
0.110
-0.006
0.665
0.949
2.135
X
StdErr
OJ87
0.027
0.185
0.260
0,454
0.025
1.718
0.953
0.066
0.032
0.075
0.002
0.143
0.325
0313
P-Val
<0.0001
-------
c 30 -
o ~-i 1 r1 -1"1! i'' " T r-—i 1 1 1 1 1—
-0.148 -0.108 -0.068 -0.028 0.012 0.052 0.092 0.132 0.172 0.212 0.252 0.292 0.332
Residuals from Model-5 (P15)
Figure 5-5. Histograms of Residuals from Fitted National Multivariate Model 3
0.6-
« 0.4 1
0.2:
0.1 -
0.0 :
R2 from Fitted Regression 0.867
ct-',;':. ••*: • '
KlMi'i- ;
o.o
0.1
0.2 0.3 0.4
Observed (P15)
0.5
0.6
Figure 5-6a. Plot of National Multivariate Model Predicted Values versus Observed with
Fitted Regression Line and 45° Reference Line for Proportion of Children
with BLL> 15ug/dL.
1 -
¥ -1-
y>
a
I "3
in
— -5
•o
S
.y
-9 •
R2 from Fitted Regression 0.821
-7
-5 -3
Observed (P15 - Logit Scale)
Figure 5-6b. Plot of National Multivariate Model Predicted Values versus Observed with
Fitted Regression Line and 45° Reference Line for Proportion of Children
with BLL > 15 ug/dL (Logit Scale)
56
-------
Table 5-5. Model 4 (Proportion > 25 Lig/dL) Parameter Estimates for Multivariate National Model
Region 1
Effect
Region
Time'Region
Region*Quarter-l
Region *Quarter-2
Region*Quarter-3
Region*Quarle*-4
EstiiMte
-9307
0.250
-0.270
O.I 11
0.441
0.000
StdErr
0.403
0074
0.055
0050
0.046
P-Vahte
- 0001
U OOU6SM
•• ouut
U U26-J&1
- 000 1
Region 2
Effect
Region
Time*Region
Region*Quarter- 1
Region *Quarter-2
Region *Qyarter-3
Region *Quarter-4
Estimate
-9.797
0.207
-0287
0.003
0.466
0000
StdErr
0.421
0081
0.043
0,039
0.036
P-Value
~ ouu i
>l UIU2I
-.uiltl
0.930399
- Owl
Region 3
Effect
Region
Time*Region
Region*Quarter- 1
Region*Quarter-2
Region*Quarter-3
Region*Quarter-4
Estimate
-9.083
0.090
-0452
-0.048
0319
0000
SldErr
0.393
0.073
0.045
0.042
0.037
P-Value
•. ooo i
0218218
. OuOl
0 248085
- M«)l
Region -4
Effect
Region
Time* Region
Region *Quarter- i
Region*Quarter-2
Region *Quarter-3
Region*Quartef-4
Estimate
-9,481
0.156
-0 122
0.204
0.313
0.000
SldErr
0380
0.070
0063
0058
0056
P-Value
• uou I
(1025874
0.051773
0 UtJUi
uou!
Region 5
Effect
Region
Time*Region
Region *Qyarter-l
Region *Quarter-2
Region *Quarter-3
Region'Quaiter-4
Estimate
-9426
0.131
-0.541
-0008
0.537
0.000
StdErr
0.379
0.07 1
0023
0020
0018
P-Value
DUljl
0064377
• ui.fH
0.705084
• IWUI
Region 6
Effect
Region
Time*Region
Region*Quarler- !
Region*Quaitei'-2
Region*Quarter-3
Region*Quartei"-4
Estimate
-9 396
0 165
-0117
0.031
0.136
0.000
SldErr
0386
0.073
0071
0.070
0.067
P-Valne
udlll
IMO4U5K
0.100954
0.658998
:'.i l;.i.M I
Region 7
Effect
Region
Time*Region
Region *Quarter-l
Region *Quarter-2
Region'Qiiai'ter-3
Region*Quarter-4
Estimate
-9474
0.172
-0409
0299
0500
0.000
StdErr
0380
0072
0.059
0050
0048
P-Value
• Will
OiJl I) IK.
WKil
IjlJUl
OUU 1
Region 8
Effect
Region
Tirne*Region
Region*Quarier-l
Region *Quartef -2
Region*Quai"ler-3
Region *Quarter-4
Estimate
-9634
0.170
-0,984
0087
-0.349
0000
StdErr
0493
0.103
0531
0.373
0.395
P-V»liie
• uuu 1
0.0981 17
0.064023
0.814912
0.376556
Region 9
Effect
Region
Time "Region
Region*Quar5er-!
Reg,on'QUimer-2
Region*Quar!er-3
Region*Quarter-4
Estimate
-8 770
0.157
-0115
-0036
0025
0000
StdErr
0.463
0.088
0.098
0.092
0.092
P-Valua
C.FUO 1
0.074789
0.241363
0691996
0.78802
Region 10
Effect
Region
Time*Region
Region*Quarlei-l
Region *Quarler-2
Region*Quailer-3
Region*Quarlei -4
Estimate
-9.959
0.206
-0.225
0.180
0.194
0.000
StdErr
0.540
0098
0388
0345
0350
P-V»luc
UUU 1
U 95th Percentile
Mean Water Lead Cone. (water
= 1)
Mean Water Lead Cone. (water
=2)
Housing Density
Median Air > 99'h Percentile
Pet Single Parent
Pet. Rented Units
X
Est.
-4.491
3.344
0.183
0.351
2.475
-1.135
-0015
6.920
-0089
0.020
0.030
-0.006
0.383
2.060
1.102
StdErr
0.275
0.268
0033
0324
0619
1.082
0.005
2.254
0.183
0.021
0,053
0.002
0.183
0.428
0.408
P-Val
O.OOOI
O.OOO!
<0.0001
0.2787
U mtol
02943
U On lu
0 in 01
0.6295
0.3446
0.5689
0.0008
0.0365
<0.0001
0.0069
X*Time
Est.
-0.478
-0005
-0.013
-0. 1 1 1
-0006
-0.928
-0.021
-0.355
0.037
SldErr
0.073
0057
0.006
0.131
0,002
0.462
0040
0.093
0.078
P-Val
<0.0001
0.9231
0.0178
0.3951
0.0004
0.0448
0.5914
0.0001
06330
X*Time2
Est.
-0.003
0002
0.173
StdErr
0.001
0.000
0.028
P-Val
O.OOOI
O.OOOI
O.OOOI
57
-------
70
60
50
40
30
20
10
-0.0525 -0.0275 -0.0025 0.0225 0.0475 0.0725 0.0975 0.1225 0.1475 0.1725 0.1975
Residuals from Model-6(P25)
Figure 5-7. Histograms of Residuals from Fitted National Multivariate Model 4
e 0.21 -
I
0.07-
R2 from Fitted Regression= 0.744
i i
0.14 0.21
Observed (P25)
Figure 5-8a. Plot of National Multivariate Model Predicted Values versus Observed with
Fitted Regression Line and 45° Reference Line for Proportion of Children
with BLL > 25 ug/dL.
0-
-1 -
-2-
-Z-
-4
-5-
-6-
-7:
-8-
-9
-10
-11 -
R2 from Fitted Regression= 0.726
Observed (P25 - Logit Scale)
Figure 5-8b. Plot of National Multivariate Model Predicted Values versus Observed with
Fitted Regression Line and 45° Reference Line for Proportion of Children
with BLL > 25 ug/dL (Logit Scale)
58
-------
5.2 High-Resolution Modeling Results
The Massachusetts final multivariate models were constructed similarly to the national models.
One basic difference in the Massachusetts models is that there was no EPA region. Thus, there
was no area variable included other than census tract. As in the national model, time period and
quarter were significant predictors; however, in the Massachusetts model they are not interacted
with an area variable. Table 5-6 presents the full set of variables included in the final
multivariate models for Models 1 through 5. Model 6 was not fit for the Massachusetts data
because of the scarcity of data above 25 ug/dL.
Among the demographic variables, housing cost, occupancy, family structure, and housing age
were significant predictors in all five models. Median Rent was the selected housing cost
variable in all five models. For occupancy, Percent Rented was the selected variable in four of
the five models. Percent of Single Parent Households is the family structure variable in all
models. Three housing age variables were included across the five models, but Percent Built
Pre-1950 was the included variable in Models 1 to 3.
Race and Income variables were included in four of the five final models. Median Household
Income and Percent Multiple Races were the two variables used in all four models. Children's
Age, Education, and Population each had a variable included in one of the final models. Number
of Children less than or equal to six years old and Total Population were included in Model 3.
Percent Without 9th Grade Education was included in Model 4.
Unlike the national models, none of the environmental variables were included in the final
multivariate models for Massachusetts. On the other hand, the housing inspection data from
Massachusetts were predictive and included in all of the final models. The percentage of units
passing the Massachusetts standard of care (calculated using the MDPH method) was included in
all five models. Additionally, the percentage of units failing the Massachusetts standard of care
(calculated using the MDPH method) was included in Models 4 and 5.
The selected programmatic funding variable was included in Models 1, 2, and 5. Current State
Funding ($ per Child) was used in the GM models and Cumulative CDC Funding ($ per tract)
was used in Model 5.
As with the national models, parameter estimates and associated standard errors and p-values are
presented for all models in Table 5-7. Figures 5-9 to 5-18 contain the histograms of residuals
and plots of observed versus predicted values that allow assessment of the various model fits.
These plots suggest that models 1-3 are performing well, with Models 4 and 5 providing a
somewhat suboptimal fit (perhaps due to fewer children being observed above the 10 and 15
ug/dL threshold values in Massachusetts). The weighted regression line fit to the observed
versus predicted plots (shown in blue) also demonstrates a systematic degradation in model
performance from Models 3 through 5, with the R2 value diminishing as the blood-lead threshold
value increases. Similar to the National Models, the High-Resolution Multivariate Models in
Massachusetts tend to under-predict for census-tract/quarter combinations with higher geometric
mean blood-lead concentrations and higher exceedance proportions. Further exploration may be
necessary to determine whether these higher values represent county/quarter combinations with
59
-------
fairly sparse data (i.e., few observations) - which might explain why they would have been less
influential in Models 2 through 6, which are influenced by the number of observations associated
with each observed value.
Appendix F presents predictions of areas of the country estimated to have the highest children's
blood-lead levels. These predictions were generated by averaging predicted values across the
four quarters of 2006. Table F-l lists the 150 counties/townships in the United States with the
highest predicted GM blood-lead levels (using Model 2) and proportion of children above 5, 10,
15, and 25 ug/dL. Table F-2 lists the 10 counties in each state with the highest levels of those
same five outcomes. Table F-3 lists the 150 Massachusetts census tracts with the highest
predicted GM blood-lead levels. Figure F-l provides a map of these 150 Massachusetts census
tracts.
60
-------
Table 5-6. Summary of Variables Included in Final Massachusetts Multivariate Model
Variable Type
Time
Income
Race
Housing Cost
Occupancy
Family Structure
Housing Age
Children's Age
Education
Population
Housing Inspection
Funding
Model 1
Time Period, Quarter
Median Household
Income
Percent Multiple Race
Median Rent
Percent Rented
Percent Single Parent
Percent Built Pre- 1950
P4 - % Passing
Standard of Care,
MDPH Method
Current State Funding
($ per Child)
Model 2
Time Period, Quarter
Median Household
Income
Percent Multiple Race
Median Rent
Percent Rented
Percent Single Parent
Percent Built Pre- 1950
P4 - % Passing
Standard of Care,
MDPH Method
Current State Funding
($ per Child)
Model 3
Time Period, Quarter
Median Household
Income
Percent Multiple Race
Median Rent
Percent Rented
Percent Single Parent
Percent Built Pre-
1950
Number less than 6
years old
Total Population
P4 - % Passing
Standard of Care,
MDPH Method
Model 4
Time Period, Quarter
Median Household
Income
Percent Multiple Race
Median Rent
Percent Rented
Percent Single Parent
Percent Occupied
Built Pre- 1940
Percent without 9th
Grade education
P4 - % Passing
Standard of Care,
MDPH Method
F4 - % failing
standard of care,
MDPH Method
Model 5
Time Period, Quarter
Median Rent
Percent Vacant
Percent Single Parent
Percent Occupied
Built Pre- 1980
P4 - % Passing
Standard of Care,
MDPH Method
F4 - % failing standard
of care, MDPH
Method
Cumulative CDC
Funding ($ per tract)
61
-------
Table 5-7. Massachusetts Multivariate Model Estimates
Model
1
(Geometric
Mean)
2
(Weighted
Geometric
Mean)
Effect
Intercept
Time
Quarter (Season)
Median Household Income
Percent Multiple Races
Median Rent ($):
Percent Rented Units
Percent Single Parent Households
Percent Units Built Before 1950
p4
Current State Funding
°\
°1*
<
2
° 'Error
Intercept
Time
Quarter (Season)
Median Household Income
Percent Multiple Races
Median Rent ($):
Percent Rented Units
Percent Single Parent Households
Percent Units Built Before 1950
p4
Current State Funding
°\
°l s
°r>A
"\
2
Error
Levels
1
2
3
4
0.229
-0.026
0.004
0.191
1
2
3
4
0.218
-0.024
0.004
3.986
Estimate
2.290
-0.088
-0.187
-0.143
0.130
0.000
-0.008
3.909
-0.032
-0.568
0.702
0.849
-0.983
0.028
2.249
-0.087
-0.185
-0.137
0.127
0.000
-0.008
3.826
-0.032
-0.561
0.735
0.866
-0.933
0.031
Standard
Error
0.033
0.002
0.007
0.007
0.006
0.001
0.536
0.005
0.069
0.096
0.047
0.166
0.006
0.033
0.002
0.006
0.006
0.006
0.001
0.531
0.005
0.069
0.096
0.046
0.164
0.006
P-
Value
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
62
-------
Model
3
(Proportion
of Children
with Blood
Lead > 5
ug/dL)
4
(Proportion
of Children
with Blood
Lead > 10
^g/dL)
Effect
Intercept
Time
Quarter (Season)
Median Household Income
Percent Multiple Races
Median Rent ($):
Percent Rented Units
Percent Single Parent Households
Percent Units Built Before 1950
Number Residents Less than Six Years of Age
Total Population
P4
<
a\A
<
Intercept
Time
Quarter (Season)
Median Household Income
Percent Multiple Races
Median Rent ($):
Percent Rented Units
Percent Single Parent Households
Percent Occupied Units Built Before 1980
Percent Residents with Less than Ninth Grade
Education
f4
p4
<
a\A
<
Levels
1
2
3
4
0.129
-0.008
0.004
1
2
3
4
-
0.175
-0.013
0.004
Estimate
-2.312
-0.146
-0.195
-0.117
0.187
0.000
-0.010
3.420
-0.044
-0.724
0.817
1.468
0.000
0.000
-0.649
0.007
0.001
0.000
-4.235
-0.136
-0.282
-0.130
0.247
0.000
-0.010
4.326
-0.057
-0.562
0.697
1.758
-0.581
1.802
-1.339
0.016
0.003
0.001
Standard
Error
0.054
0.003
0.010
0.010
0.009
0.001
0.610
0.006
0.083
0.119
0.056
0.000
0.000
0.218
0.052
0.005
0.022
0.021
0.020
0.001
0.781
0.009
0.119
0.159
0.083
0.279
0.504
0.321
P-
Value
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
0.0114
0.0292
0.0029
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
<0001
0.0372
0.0004
<0001
63
-------
Model
5
(Proportion
of Children
with Blood
Lead > 15
ug/dL)
Effect
Intercept
Time
Quarter (Season)
Median Rent ($):
Percent Vacant Units
Percent Single Parent Households
Percent Occupied Units Built Before 1980
f4
p4
Cumulative CDC Funding
°l
°0
atoA
al
Levels
1
2
3
4
0.249
-0.020
0.004
Estimate
-6.278
-0.093
-0.307
-0.093
0.332
0.000
-0.049
1.039
1.002
1.187
4.047
-1.476
0.000
0.035
0.008
0.002
Standard
Error
0.144
0.008
0.042
0.040
0.036
0.010
0.335
0.169
0.168
0.677
0.440
0.000
P-
Value
<0001
<0001
<0001
0.019
<0001
<0001
0.0019
<0001
<0001
<0001
0.0008
0.0127
64
-------
20.0 ~
17.5 ~
15.0 ~
~ 12.5 ~
s
% 10.0 ~
7.5 ~
5.0 -
2.5 ~
/
^A
L
i
-
/
~
/•
N
\
-A
\
\
TV
J — i — i iv^i — i
0
-1.875 -1.275 -0.675 -0.075 0.525 1.125 1.725 2.325 2.925 3.525 4.125 4.725
Residuals from Model-1 (GM)
Figure 5-9. Histograms of Residuals from Fitted Massachusetts Multivariate Model 1
52.
•5
1
a.
7-
6-
5-
4-
3:
2-
1 -
0-
345
Observed (GM)
R2 from Fitted Regression = 0.697
Figure 5-10. Plot of Massachusetts Multivariate Model Predicted Values versus
Observed with Fitted Regression Line and 45° Reference Line for
Unweighted Geometric Mean Response
65
-------
s.
20.0 ~
17.5 ~
15.0 ~
12.5 ~
10.0 ~
7.5 ~
5.0 -
2.5 ~
J
r-
l
1
~,
'
/
—
^
^
\
\
\
nw^
0 T ' T ' I I I I I I T ' T" ' T' TI ' I ' I T
-1.8 -1.35 -0.9 -0.45 0 0.45 0.9 1.35 1.8 2.25 2.7 3.15 3.6 4.05 4.5 4.95
Residuals from Model-2 (Weighted GM)
Figure 5-11. Histograms of Residuals from Fitted Massachusetts Multivariate Model 2
o
7-
6-
I 2
Q.
1
0
012345678
Observed (Weighted GM)
R2 from Fitted Regression = 0.700
Figure 5-12. Plot of Massachusetts Multivariate Model Predicted Values versus
Observed with Fitted Regression Line and 45° Reference Line for Weighted
Geometric Mean Response
66
-------
-0.34 -0.26 -0.18 -0.1 -0.02 0.06 0.14 0.22 0.3 0.38 0.46 0.54 0.62
Residuals from Model-3(P5)
Figure 5-13. Histograms of Residuals from Fitted Massachusetts Multivariate Model 3
0.84:
0.72:
0.60:
~ 0.48 :
I
5 0.36 :
£ ;
0.24 :
0.12:
0.00 :
R2 from Fitted Regression = 0.579
T ~
0.00
0.12
0.24
0.36
0.48
0.60
0.72
0.84
Observed (P5)
Figure 5-14a. Plot of Massachusetts Multivariate Model Predicted Values versus
Observed with Fitted Regression Line and 45° Reference Line for
Proportion of Children with BLL > 5 jjg/dL.
2-
1 •
5T
CO
£ 0-
±i
D)
° -1-
i -2-
~s
1 ~3"
-4'
-5-
R2 from Fitted Regression = 0.531
Observed (P5 - Logit Scale)
Figure 5-14b. Plot of Massachusetts Multivariate Model Predicted Values versus
Observed with Fitted Regression Line and 45° Reference Line for
Proportion of Children with BLL > jjg/dL (Logit Scale)
67
-------
35
30
25
g 20
& 15
10
5
o -—TT , , —i r—i 1 1 1 1 1 r—
-0.12-0.09-0.06-0.03 0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0.24 0.27 0.3 0.33 0.36
Residuals from Model-4(P10)
Figure 5-15. Histograms of Residuals from Fitted Massachusetts Multivariate Model 4
0.0-
R from Fitted Regression = 0.312
Observed (P10)
Figure 5-16a. Plot of Massachusetts Multivariate Model Predicted Values versus
Observed with Fitted Regression Line and 45° Reference Line for
Proportion of Children with BLL > 10 ug/dL
= -2:
o) :
O :
_1 ;
o -3-i
E :
£ -4\
O ;
-6-
R2 from Fitted Regression = 0.269
Observed (P10 - Logit Scale)
Figure 5-16b. Plot of Massachusetts Multivariate Model Predicted Values versus
Observed with Fitted Regression Line and 45° Reference Line for
Proportion of Children with BLL > 10 ug/dL (Logit Scale)
68
-------
60
50 -
40
30 -
20
10
0
-0.039 -0.015 0.009 0.033 0.057 0.081 0.105 0.129 0.153 0.177 0.201 0.225 0.249 0.273
Residuals from Model-5(P15)
Figure 5-17. Histograms of Residuals from Fitted Massachusetts Multivariate Model 5
R2 from Fitted Regression = 0.119
0.1 -
0.0-
Observed (P15)
Figure 5-18a. Plot of Massachusetts Multivariate Model Predicted Values versus
Observed with Fitted Regression Line and 45° Reference Line for
Proportion of Children with BLL > 15 ug/dL.
_ ,
o -2:
R2 from Fitted Regression = 0.113
nrnrnr-|-nrnrnrnr-|-nrnr-rnr^rnrnr-r
-3 -2
Observed (P15-Logit Scale)
Figure 5-18b. Plot of Massachusetts Multivariate Model Predicted Values versus
Observed with Fitted Regression Line and 45° Reference Line for
Proportion of Children with BLL > 15 ug/dL (Logit Scale)
69
-------
6.0 GRAPHICAL PRESENTATION OF MODELING RESULTS
In addition to the discussion of the final multivariate modeling results in Section 5, it is
informative to be able to view the results visually. Two methods were utilized for graphical
presentation of the results - mapping and via use of an interactive software tool. Section 6.1
presents a subset of the maps generated, while Sections 6.2 and 6.3 discuss the interactive
software tool.
6.1 Maps of Observed and Predicted Blood-Lead Outcomes
Mapping is an informative method to graphically present the results of the multivariate models.
Figure 6-1 contains maps displaying the observed levels of GM blood-lead levels in 2000 and
2005 based on CDC's national surveillance data, and the comparable predicted GM blood-lead
levels in 2000 and 2005. A key difference is that maps of observed levels contain many counties
with missing data either because they do not submit childhood lead surveillance data to CDC or
they have too few test records to be included in the analysis, while the maps of predicted levels
covers all counties in the country. Appendix G contains detailed maps from the national level
models of GM blood-lead levels and proportion of children with BLLs > 10 ug/dL.
Because it is difficult to view many of the individual counties within the U.S.-level maps,
regional-level maps also were produced. Figure 6-2 contains examples of these for EPA Region
V. Comparable maps for all regions are included in Appendix G. With darker colors
representing areas of higher lead levels, it appears that lead levels are declining across EPA
Region V from the 2000 to 2005 time period. Figure 6-3 contains maps of observed and
predicted proportion of children's blood-lead levels in Massachusetts at the census-tract level.
The Boston area is enlarged to better show the tracts in that area.
6.2 Visualization Tool Development
In addition to generating maps, a software tool was developed to provide a flexible way for users
to quickly view data for particular areas and to obtain information that led to the results being
viewed. To do this, the project team utilized existing technology developed through internal
research and modified this technology to meet the needs of this study. The software sews
together a series of static maps so that they can be viewed dynamically. This allows users to
view a movie of changes in surfaces over time and space.
The software is written in C++. Users interact with the software via a Windows GUI that is
implemented using Microsoft Foundations Classes (MFC). The 3-dimensional graphics within
the tool were implemented using an Open Graphics Library (OpenGL).
70
-------
Predicted Proportion of Blood-Lead Levels >/= 10 ug/dL
2000
Observed Proportion of Blood-Lead Levels >/= 10 ug/dL
2000
LEGEND
Proportion of PbB Levels >/= 10 ug/dL
I I 0 - 0.0099 EH 0.075 - 0.0999
I I 0.01 - 0.0249 • 0.10 - 0.2499
EH 0.025 - 0.0499 • 0.25 +
I I 0.05 - 0.0749 EH No data
United States
LEGEND
Proportion of PbB Levels >/= 10 ug/dL
I I 0 - 0.0099 • 0.075 - 0.099S
EH 0.01 - 0.0249 • 0.10 - 0.2499
EH 0.025 - 0.0499 • 0.25 +
I I 0.05 - 0.0749 I I No data
United States
Predicted Proportion of Blood-Lead Levels >/= 10 ug/dL
2005
Observed Proportion of Blood-Lead Levels >/= 10 ug/dL
2005
LEGEND
Proportion of PbB Levels >/= 10 ug/dL
I I 0 - 0.0099 EH 0.075 - 0.0999
I I 0.01 - 0.0249 • 0.10 - 0.2499
EH 0.025 - 0.0499 • 0.25 +
I I 0.05 - 0.0749 EH No data
United States
LEGEND
Proportion of PbB Levels >/= 10 ug/dL
I I 0 - 0.0099 EH 0.075 - 0.0999
EH 0.01 - 0.0249 M 0.10 - 0.2499
EH 0.025 - 0.0499 • 0.25 +
EH 0.05 - 0.0749 EH No data
United States
Figure 6-1. Observed and Predicted Proportion of Children with Blood-Lead Levels > 10 ug/dL in the United States
by County, 2000 and 2005
71
-------
Predicted Proportion of Blood-Lead Levels >/= 10 ug/dL
2000
Observed Proportion of Blood-Lead Levels >/= 10 ug/dL
2000
LEGEND
Proportion of PbB Levels >/= 10 ug/dL
I I 0 - 0.0099 • 0.075 - 0.0999
CH 0.01 - 0.0249 • 0.10 - 0.2499
Q 0.025 - 0.0499 M0.25 +
I I 0.05 - 0.0749 I I No deta
US EPA Region 5
LEGEND
Proportion of PbB Levels >/= 10 ug/dL
I I 0 - 0.0099 • 0.075 - 0.0999
CH 0.01 - 0.0249 • 0.10 - 0.2499
CH 0.025 - 0.0499 «0.25 +
I I 0.05 - 0.0749 I I No dsta
US EPA Region 5
Predicted Proportion of Blood-Lead Levels >/= 10 ug/dL
2005
Observed Proportion of Blood-Lead Levels >/= 10 ug/dL
2005
LEGEND
Proportion of PbB Levels >/= 10 ug/dL
I I 0 - 0.0099 • 0.075 - 0.0999
CH 0.01 - 0.0249 • 0.10 - 0.2499
Q 0.025 - 0.0499 M0.25 +
I I 0.05 - 0.0749 I I No A&a
US EPA Region 5
LEGEND
Proportion of PbB Levels >/= 10 ug/dL
I I 0 - 0.0099 • 0.075 - 0.0999
a 0.01 - 0.0249 • 0.10 - 0.2499
CH 0.025 - 0.0499 ^0.25 +
I I 0.05 - 0.0749 I I No dsta
US EPA Region 5
Figure 6-2. Observed and Predicted Proportion of Children with Blood-Lead Levels > 10
ug/dL in Region V by County, 2000 and 2005
72
-------
Proportion of PbB Levels >/= 10 ug/dL
EZl 0 - 0.0099 • 0.075 - 0.0999
CD 0.01 - 0.0249 • 0.10-0.2499
CZI 0.025 - 0.0499 • 0.25 +
a 0.05-0.0749 CZ] No data
Massachusetts
Observed Proportion of Blood-Lead Levels >/= 10 ug/dL
2005
Proportion of PbB Levels>/= 10 ug/dL
EZl 0 - 0.0099 • 0.075 - 0.0999
EH 0.01 - 0.0249 • 0.10-0.2499
EZl 0.025 - 0.0499 • 0.25 +
I I 0.05 - 0.0749 I I No data
Massachusetts
Predicted Proportion of Blood-Lead Levels >/= 10 ug/dL
2000
Predicted Proportion of Blood-Lead Levels >/= 10 ug/dL
2005
LEGEND
Proportion of PbB Levels >/= 10 ug/dL
I I 0 - 0.0099 • 0.075 - 0.0999
Q 0.01-0.0249 • 0.10-0.2499
a 0.025 - 0.0499 • 0.25 +
CD 0.05 - 0.0749 CZlNodata
Massachusetts
LEGEND
Proportion of PbB Levels>/= 10 ug/dL
I I 0 - 0.0099 • 0.075 - 0.0999
m 0.01-0.0249 • 0.10-0.2499
CZl 0.025 - 0.0499 • 0.25 +
I I 0.05 - 0.0749 I I No data
Massachusetts
Figure 6-3. Observed and Predicted Proportion of Children with Blood-Lead Levels > 10 ug/dL in Massachusetts by Census
Tract, 2000 and 2005
73
-------
The software visualizes the observed values and the predicted values for the response variable of
each model (6 national models, 5 Massachusetts models). The software interpolates the
predicted values spatially within each state using a squared inverse distance algorithm; it
interpolates linearly in time. The predicted values are defined for each county. There are two
visualization modes: (1) a spatial surface moving in time, and (2) a time series. The tool was
built in a flexible way so that it can be easily adapted to accept updated data.
Figures 6-4 and 6-5 are screen shots from the visualization tool. Figure 6-4 provides an example
of a response surface generated by the tool to illustrate predicted blood-lead levels across a
geographic area. In this example, the area is the state of Illinois. Figure 6-5 provides an
example of a method the visualization provides to plot predicted blood-lead levels in a given
geographic area over time.
74
-------
-iJlxj
I ^jj Inbox - Microsoft Outlook | _J Welcome to Dish Networ... | -gjj Document 1 -Microsoft ... | _j MationaJModel
Untitled - R1pVis
Figure 6-4. Response Surface of Predicted Geometric Mean Blood-Lead Concentration
Across the State of Illinois from the Visualization Tool
75
-------
rf'Start] | «i> & \Js\ ** | _J Inbox - Microsoft Outlook | J Welcome to Dish Networ... [ ^ Document 1 -MicrosoFt ... ] _, MationaModel ||g Untitled - RipVis © _J '^^^^ ;;" €> '^ C'"^ 9:0? AM
Figure 6-5. Time Series Plot of Observed and Predicted Geometric Mean Blood-Lead
Concentration in Cook County Illinois from the Visualization Tool
16
-------
7.0 DISCUSSION AND FUTURE WORK
The goal of this study was to determine whether tools could be developed to differentiate
geographic areas (counties and census tracts), based on their predicted risk of containing children
with elevated blood-lead levels. Statistical models were developed that link CDC's childhood
blood-lead surveillance data to demographic predictor variables available in the 2000 U.S.
Census. While earlier chapters of this report focus on the development and performance of these
statistical models, this chapter provides a discussion of the factors that should be considered
when using the models, and some preliminary ideas for improvement.
7.1 Major Findings
The results of this study suggest that longitudinal predictive models can be developed at the
county level across the nation based on the use of quarterly summary information from CDC's
National Surveillance Database, and at the census-tract level within states that have a long
history of universal screening and reporting, such as Massachusetts. These models can be used
to describe how risk of childhood lead poisoning changes over time within different regions of
the country, as well as within small geographic areas within states (e.g., counties) and even
smaller geographic areas within counties (e.g., census tracts). They can be used to predict the
risk of childhood lead poisoning in counties (or census tracts) with little or no surveillance data,
and also can be used to identify those counties (or census tracts) that are at highest risk at the end
of the period of observation (see Appendix F for a list of the 150 counties across the country at
highest risk predicted by each of the six models, as well as the top 10 counties within each state).
The statistical model chosen (a random-effects model with separate intercepts and slopes
estimated within each county or census tract) also allows ranking of geographic areas based on
the rate of decline over time after accounting for the fixed-effects variables of the model
(although only among those areas that provided adequate surveillance data). Within the context
of the Broad-Based National Model, these random effects would allow us to identify those
counties that are experiencing a more rapid reduction in risk of childhood lead poisoning over
time (to identify best practices) and those counties that are experiencing a significantly less rapid
decline over time (to identify areas in need of additional attention and resources for combating
lead poisoning), after already accounting for the demographic, programmatic, and environmental
factors included in the multivariate model.
Within the context of the series of Broad-Based National Models, the data suggest that there are
significant differences in the distribution of childhood blood-lead concentrations among the
different regions of the country, and that the manner in which these distributions change over
time and are impacted by seasonality also is regionally specific. After accounting for these
regional differences, a number of demographic, environmental, and programmatic variables were
found to be highly predictive of childhood blood-lead concentrations among the different
response variables modeled within this project. The specific variables that were found to be
predictive within the multivariate models varied based on the response variable; however, there
were certainly some variables that were found to be selected in multiple models. In addition to
various census demographic variables that were identified in previous risk modeling efforts (e.g.,
age of housing, percent single parent families, race/ethnicity), it was found that variables
constructed from EPA's Safe Drinking Water Information System, time-lagged programmatic
77
-------
funding information from HUD and/or CDC, and variables associated with high lead emmisions
or predicted air concentrations were selected within the National (Low Resolution) multivariate
statistical models.
Within the context of the High-Resolution Model developed using data from the Commonwealth
of Massachusetts, the project team also found a highly significant downward trend in the risk of
childhood lead poisoning among the five models developed. Due to a very small number of
children observed at or above 25 ug/dL within Massachusetts over the 2000-2006 period of
observation - this sixth model was not included. After accounting for the long-term reduction
over time and seasonality using similar methods that were employed in the Broad-Based
National Model, we found that only the demographic and programmatic variables were
predictive of the risk of childhood lead poisoning at the census-tract level. Of particular interest
were the variables that described the proportion of housing units within each census tract that
were found to be in compliance and out of compliance with the Massachusetts Standard of Care.
In all five of the multivariate models, the risk of childhood lead poisoning was significantly
reduced as the proportion of housing units in compliance increased within a census tract. In
addition, for the last two models (which predicted proportion of children at or above 10 and 15
ug/dL), the risk of childhood lead poisoning increased significantly as the proportion of housing
units out of compliance increased within a census tract.
7.2 Comparison Between Results and NHANES
Due to selection bias associated with surveillance data, it is expected that the CDC National
Surveillance dataset as well as the Massachusetts surveillance data may show higher proportions
of elevated blood-lead concentrations than found in the general population. For this reason, the
proportion of children with elevated blood-lead concentrations as well as the distribution of the
potential continuous summary measure derived from the surveillance data were compared with
those reported by the most recent six years of available CDC National Health and Nutrition
Examination Survey (NHANES). Results of this comparison are presented graphically in Figure
7-1 - suggesting that there is a highly significant difference between the NHANES and CDC's
National Surveillance Database with respect to the proportion of children observed at or above 5
ug/dL (with lesser differences observed for the proportion of children observed at or above 10,
15, and 25 ug/dL). In future work on this project, EPA might consider methods for calibrating
the Surveillance data to better match the National Distribution of childhood blood-lead
concentrations using methods similar to those employed by Strauss, et. al. 200la.
78
-------
2.5-
2.0-
S. LO-
ST
o
0.5-
0.0-
Legend:
NHANES
National Surveillance
1994 1996 1998 2000 2002 2004 2006
time
a)
0.5-
0.4-
0.3-
•D
O
5 0.2
S"
0.1-
0.0-
Legend:
NHANES
National Surveillance
1994 1996 1998 2000 2002 2004 2006
time
0.5-
0.4-
II
A
•D 0.3
re
tu
5 0.2-
2T
o
0.1 -
o.o-
Legend:
NHANES
National Surveillance
•I i •
1994 1996 1998 2000 2002 2004 2006
time
0.5-
0.4-
ii
A
•D 0.3
5 0.2-
2T
o
0.1-
o.o-
Legend:
NHANES
National Surveillance
•I i
1994 1996 1998 2000 2002 2004 2006
time
Figure 7-1. Comparison of National Surveillance Data to NHANES Data
79
-------
7.3 Data Issues
The models that were developed as part of this project are based on data sources that have both
strengths and limitations. In this section, four potentially limiting aspects of the data are
considered - biases from the geocoding process, biases inherent in the surveillance data,
reporting limits in surveillance data, predicting within-area relationships with ecological models,
and use of Census data from 2000.
7.3.1 Biases from Geocoding
The quarterly summary statistics from CDC's National Surveillance Database utilized in these
analyses were available at the county level of geographic specificity. CDC based this
summarization on county FIPS codes reported by its grantees. This field is quite well reported in
CDC's CBLS database. The Massachusetts surveillance data was summarized and analyzed at
the census-tract level, with the geocoding of address data within the Massachusetts data being
conducted by MDPH staff. While there is no reason to suspect lack of data quality within the
Massachusetts surveillance data, experience shows that the process of geocoding can introduce
some subtle biases into surveillance data. Thus, the following section is offered as a guide for
EPA to consider for future modeling efforts in which state or local surveillance data are
geocoded to the census-tract level:
The geocoding process is highly dependent on the quality of address data recorded by the local
lead poisoning prevention programs with whom the blood-lead information originated. Several
factors could prevent an address from being successfully geocoded, such as:
• Erroneous, illegible, or purposefully misleading address information being provided to
the childhood lead poisoning prevention program
• Address data that contain either a P.O. Box or Rural Route as part of the street address,
which typically cannot be successfully geocoded
• Errors in data entry.
While these problems with address data are likely to occur in all programs with a non-trivial
frequency, there may be a systematic bias that programs introduce (albeit unintentionally) when
correcting address data. It is likely that address data errors are identified and corrected with
higher frequency for children who have an elevated blood-lead level and require follow up.
Given the potential bias introduced through the geocoding process, further research may be
worthwhile to determine whether there are reasonable approaches that could be used to adjust the
models for this bias.
7.3.2 Reporting Limits in Surveillance Data
Other naturally occurring biases in the surveillance data may influence the degree to which
models are representative of the true trends in childhood lead poisoning. For example, within the
context of the Broad-Based National Model, there may be differences between states and
localities in the manner in which childhood blood-lead testing results are reported to CDC.
Sections 2 and 3 included a discussion about a screening algorithm that was applied to the
80
-------
surveillance data to supress county/quarter data combinations for areas that were not conducting
universal reporting of blood-lead testing results. Additional scrutiny of these data by CDC and
other members of the lead poisoning prevention community may reveal other county/quarter
combinations that were not identified through the screening algorithm that should be excluded
from these analyses. We are confident that the overall impact of including these data in the
current work will not severely bias the fixed-effects parameter estimates in the series of
generalized linear mixed models developed in this project.
7.3.3 Selection Bias in Surveillance Data
Selection bias is perhaps the most serious bias that is yet left unaccounted for in the models that
have been developed, and may have severe impact on their predictive ability. Surveillance data
are observational by nature, and are not designed to be representative of the general population.
There are many competing forces that influence whether or not a child is screened at an
appropriate age, and recorded in the blood-lead surveillance database. Some have hypothesized
that surveillance data in the urban environment are representative of the affluent (who have
private health insurance) and the poor (who receive Medicaid or other medical assistance), while
under-representing the working poor (who may have no health insurance, and no mechanism for
receiving appropriate preventive medical testing). While this may be true in general, many
outstanding lead poisoning prevention programs currently are extending outreach, education, and
screening services to areas with historically high incidence of childhood lead poisoning. These
programs generally provide assistance to all members of the community, regardless of
entitlement status. While these services are typically offered in high-risk urban areas with the
infrastructure of a federally funded (CDC and/or HUD) or state-funded lead poisoning
prevention program, they typically are less available in similar high-risk rural areas without
similar infrastructure. In addition to outreach, education and screening activities, many
childhood lead poisoning prevention programs (or partnering housing agencies) receive funding
from HUD's Office of Lead Hazard Control to conduct environmental investigations and reduce
lead hazards in the residential environment. Many of these activities generate targeted screening
of children living in deteriorated, older housing - which also is a non-trivial source of selection
bias in the surveillance data.
An important question for EPA to address is how selection bias is likely to influence the relative
rankings of counties within a region or census tracts within a more localized area, as well as the
predictive ability of the models themselves.
7.3.4 Limitations of Ecological Models for Predicting Within-Area Relationships
The models that were developed within this project are ecological models that describe quarterly
distributional summary statistics within geographic areas as a function of predictor variables
assessed within those same geographic areas. It also may be the case that some of these
predictor variables have significant variation within a county (or census tract) - and that this
within-area variation is highly predictive of risk of childhood lead poisoning within these
geographic areas. Unfortunately, the data limitations within this study (for both the blood-lead
response variables as well as many of the predictor variables) prohibit us from ascertaining these
important person-level relationships. This type of relationship can be established only by linking
81
-------
individual blood-lead concentration data with individual-level environmental, demographic,
and/or programmatic information (which usually is not available).
Within the context of the High-Resolution Massachusetts Model - it may be possible to link
individual blood-lead records with the longitudinal housing inspection information to assess the
loss of information associated with going from an individual-level model to an area-based
ecological model. This type of assessment could be introduced in later stages of this project.
7.3.5 Use of 2000 Census Data and Other Time Invariant Data as Predictors
One potential criticism of the modeling effort is that we are linking blood-lead surveillance data
collected between 1995 and 2006 to census data that were collected in 2000. Is the demographic
information collected in 2000 likely to remain unchanged over the course of time? The answer
probably depends on the variable under consideration. For example, age of housing in census
tracts or proportion of housing built prior to 1950 is not likely to change dramatically in census
tracts, unless there is a lot of demolition or new construction occurring. On the flip side, average
income is likely to change substantively over time.
Even though the demographic information contained in the 2000 Census is likely to change over
time, the more important question is what effect will that change have on our model predictions?
While the models likely would be improved with the use of more current census data for use as
predictors, we do not believe that the use of older (less current) information will result in poor or
inaccurate prediction. In fact, for the purpose of predicting current or future trends in childhood
lead poisoning, we are more concerned with the age of the surveillance data that are being used
as the response variable in this modeling exercise than with the age of the predictor variables.
Similar arguments can be made for the use of static air modeling data, and averaged information
from EPA's Toxics Release Inventory.
7.4 Model Validation Issues
The risk index models developed as part of this project may require validation before being used
by childhood lead poisoning prevention programs throughout the country. The following four
issues might be considered by EPA as being important to address as part of this validation
exercise:
1. Within counties and/or census tracts that contribute blood-lead information to the
models, how representative is the screened population of children (on which the
models are based) of the general population of children?
2. Within counties and/or census tracts that do not contribute much information to the
models (e.g., counties with low screening penetration), how well does the model
perform at predicting relative risk and blood-lead distributions?
3. Can risk index models based on historical blood-lead data from 1995 through 2005
accurately predict risk and blood-lead distributions in future years (e.g., can it be used
to forecast towards the federal 2010 goal)?
82
-------
4. Can the High-Resolution Model developed in Massachusetts be generalized to predict
risk and blood-lead distributions in other states across the Nation (or even within EPA
Region 1)?
If EPA is to provide childhood lead poisoning prevention programs with a risk characterization
tool based on these models, a comprehensive validation should be pursued to address the above
four issues.
Validation of the Surveillance Data
The first issue is related to the quality of the data supporting model development. For example,
if CDC's surveillance data are biased toward inclusion of high-risk children (as shown in the
comparisons to NHANES), the risk index models also will be biased and tend to over-predict
children at high risk. Note that if the bias is consistent among all counties and census tracts (i.e.,
it over-represents high risk children everywhere), the model predictions for the proportion of
children in each blood-lead category likely will be biased, while the ability for risk indices to
differentiate between high- and low- risk areas will be preserved. If the biases occur differently
in different areas, non-trivial adjustments to the model would need to be pursued prior to use by
childhood lead poisoning prevention programs.
Because the unit of analysis in the development of the Broad-Based National Model is at the
county level, the goal of a validation exercise would be to determine whether the distribution of
children's blood-lead concentrations that are included in the surveillance data for a sample of
census tracts are representative of the general population of children found within those census
tracts. One possible approach, would be to develop a field testing validation survey, in which a
stratified random sample of counties are selected for a short-term outreach campaign in which
eligible children are sampled in a representative manner. Stratification variables to be
considered would be Rural/Suburban/Urban, predicted level of risk from the model, and possibly
levels of socio-economic status. Obviously, development of such a survey would be costly,
difficult to implement, and likely beyond the scope of this project. Alternatively, CDC might be
able to reveal the specific counties that participated in various waves of NHANES - with
comparisons being made in those specific counties. Access to the identification of the specific
counties from which NHANES study subjects were sampled (within the NHANES analysis
dataset) would provide this project with the best foundation to address the serious biases
identified in Section 7.2 and calibrate the model to ensure that it is more reflective of the U.S.
population.
Validation of the Models in Areas with Low Screening Penetration
This second issue relates to the performance of the risk index models in predicting both relative
risk and the number of children in different blood-lead categories in the census tracts that
historically had low screening penetration. Due to the fact that there is little to no data in these
geographic areas to determine the fit of the risk index models, some field studies similar to the
one described in the previous section would need to be conducted to address this issue. The
major difference between the two field studies is that the census tracts chosen for this validation
exercise would be tracts in which the screening penetration is low.
83
-------
A similar approach could be used to conduct this field validation exercise in which a stratified
sample of counties would be identified for the study, and a representative sample of children's
blood-lead levels would be obtained within those census tracts using an intense, brief outreach
effort. The counties again would be chosen using a stratified random sampling approach, to
obtain a sample of tracts that represents a combination of high-, medium- and low-risk areas in
the rural, suburban, and urban environments. This is an area for potential future collaboration
with CDC and perhaps some of their lead poisoning prevention grantees.
Validation of the Models in Predicting Future Blood-Lead Concentrations
Validation of this third issue can be performed to a certain extent using data that are already
available as part of the modeling process. For example, in the national model where data are
available from 1995 through 2005, data for a state or set of states can be removed for one or
more years and the missing data predicted by the model. If all 2005 data were removed, models
would be developed using the data from 1996 through 2004, and then the "future" predictive
ability of those models can be assessed by applying them to the data from 2005.
Validation of the Models in Predicting Blood-Lead Concentrations in Other Geographic Areas
The last type of validation involves the determination of synergies (or lack thereof) in prediction
between the Broad-Based National Model and the High-Resolution Model. Conceptually, we
should be able to aggregate the modeling predictions from multiple census tracts within a county
from the low-resolution model and match the county-level predictions from the Broad-Based
National Model. Due to the fact that the National Model and Massachusetts Models were
developed independently, using different data sources for the surveillance data (CDC and
MDPH), and utilizing different predictor variables - these synergies may not exist.
Further work on integrating the Broad-Based National Model with the High-Resolution Model
(or multiple high resolution models if EPA is successful at expanding this project to include
multiple additional programs) can be done by fitting these two types of models jointly under the
concept of hierarchical linear modeling. This type of model, while more sophisticated and
computer-intense, can be developed using specialized software under a Monte-Carlo Markov
Chain Bayesian formulation.
7.5 Other Recommendations for Immediate Future Work
The previous sections within Chapter 7 focus on various important issues related to the
development of models to predict risk of childhood lead poisoning at the geographic level,
including calibration to the nationally representative trends over time observed in NHANES,
assessment of the potential impact of a variety of important biases and other data quality issues,
and various model validation exercises that can be explored. EPA also has been including other
state and local lead poisoning prevention programs as part of the project conference calls in
anticipation of developing additional High-Resolution Models as part of follow-up work to this
project. While these are all worthy tasks to pursue as part of future work, there are some
additional analyses that the project team would recommend pursuing on the Broad-Based
National Model as well as the High-Resolution Model within Massachusetts prior to approval of
this report as a final report. These activities include the following:
84
-------
• Broad-Based National Model
o CDC grantee relationship managers may have insight into data quality issues
(such as the previously discussed laboratory minimum reporting values, and not
following universal reporting guidelines) for specific geographic areas and
periods of time. Additional scrutiny of these data could be used to improve the
quality of the blood-lead response data that serves as a basis for these models.
The maps and visualization tool should help foster this review of the data.
o In addition to the above data review -further investigation into using urban vs.
rural status as a potential effect modifier in the analyses also is recommended.
Differentiating between urban and rural areas can be conducting in numerous
ways, including:
• Determining whether the county is part of a Metropolitan Statistical Area
within the 2000 US Census
• Identification of the counties that contain the U.S. top 100 (or 200) cities
based on population size
• Use of a population density score (with a cut-off value).
Use of this variable as a potential effect modifier might include fitting separate
intercepts and slopes for the effects of time and seasonality within the different
regions of the country, as well as the potential for using different environmental,
programmatic, and demographic predictor variables in these two area types in the
multivariate predictive models.
o Once the proper way of handling the potential effect modifier for rural versus
urban areas - the exploratory analyses that assess the predictive ability of each
candidate environmental, programmatic, and demographic variable could be refit
in a manner consistent with the baseline effects that will be included in the model.
Thus - rather than assessing the predictive ability of a candidate variable after
adjusting for the downward trend of time, it should be assessed after adjusting it
for region, region*time, region*seasonality, and potentially region*urban/rural.
High-Resolution Model in Massachusetts
• Due to the fact that we know that Massachusetts followed universal screening and
reporting guidelines during the entire period of observation (2000-2006), and the fact that
these data have been used previously to support federally funded research projects - there
is less concern about some of the previously mentioned data quality issues. This does not
mean that the Massachusetts data are not potentially biased or flawed, as there are still
probable selection biases and potential geocoding biases that were introduced into the
analysis dataset that supports the High-Resolution Model. Our collaborators at the
Massachusetts Department of Public Health are invited to review and comment on this
work, and add their insight and experience in making recommendations on additional
ways of handling the various data sources that were integrated into this model.
• It also is recommended that comparisons be made between the observed and predicted
data from the Broad-Based National Model for counties in Massachusetts (based on the
input data received from CDC) with the observed and predicted data from the High-
85
-------
Resolution Model (based on the input data received from MDPH) by aggregating the
observed and predicted census tract data within Massachusetts to the county level.
• Finally, pursuit of some additional analyses of the individual-level data from MDPH is
recommended - by linking individual blood-lead testing results on children over time to
the housing inspection results (as well as other census-tract level predictors that were
used in the current High-Resolution Model). This will help identify the degree of
information loss experienced by pursuit of the ecological models of aggregate summary
data.
86
-------
8.0 REFERENCES
24 CFR Part 35; 40 CFR Part 745, Lead;. Requirements for Disclosure of Known Lead-Based
Paint and/or Lead-Based Paint Hazards in Housing; Final Rule (3/6/1996). Accessed at
http://www.leadsafehomes.info/pdfs/all titleten fulltext english.pdf#search=%22HUD%
201018%20Rule%22
40 CFR Part 745, Lead;. Identification of Dangerous Levels of Lead; Final Rule (1/5/2001).
Accessed at http://www.epa.gov/fedrgstr/EPA-TOX/2001/January/Dav-05/t84.pdf
Battelle. Draft Quality Management Plan for the Targeting Elevated Blood-lead Levels in
Children Pilot Study. February 2007.
CDC. 1997. Screening Young Children for Lead Poisoning: Guidance for State and Local Public
Health Officials, edited by U.S. Department of Health and Human Sevices. Atlanta GA:
Public Health Sevices, CDC.
HUD. 1995. The Relation of Lead Contaminated House Dust and Blood-Lead Levels Among
Urban Children. Washington DC: U.S. Department of Housing and Urban Development.
Lanphear, BP, TD Matte, J Rogers, RP Clickner, B Dietz, RL Bornschein, P Succop, KR
Mahaffey, S Dixon, W Galke, M Rabinowitz, M Farfel, C Rohde, J Schwartz, P Ashley,
and DE Jacobs. 1998. The contribution of lead-contaminated house dust and residential
soil to children's blood-lead levels. A pooled analysis of 12 epidemiologic studies.
Environ Res 79 (l):51-68.
Miranda, ML, DC Dolinoy, and MA Overstreet. 2002. Mapping for Prevention: GIS Models for
Directing Childhood Lead Poisoning Prevention Programs. Environmental Health
Perspectives 110 (9):947-53.
Miranda, ML, JM Silva, MA Overstreet Galeano, JP Brown, DS Campbell, E Coley, CS
Cowan, D Harvell, J Lassiter, JL Parks, and W Sandele. 2005. Building Geographic
Information System Capacity in Local Health Departments: Lessons from a North
Carolina Project. Am J Public Health 95 (12):2180-5.
Spivey, Angela. The Weight of Lead: Effects Add Up in Adults. Environmental Health
Perspectives Volume 115, Number 1, January 2007.
Strauss, Warren, R Carroll, Steve Bortnick, John Menkedick, and B Schultz. 200la. Combining
Datasets to Predict the Effects of Regulation of Environmental Lead Exposure in Housing
Stock. Biometrics 57:203-210.
Strauss, Warren, Ramzi Nahhas, Leanna House, Amy Kurokawa, and Bradley Skarpness. 2001b.
Development of Models to Predict Risk of Childhood Lead Poisoning at the Census Tract
Level. Columbus OH: Technical Report to the U.S. Centers for Disease Control and
Prevention under Contract No. 200-98-0102.
Strauss, Warren, Tim Pivetz, P Ashley, John Menkedick, E Slone, and S Cameron. 2006.
Evaluation of Lead Hazard Control Treatments in Four Massachusetts Communities
through Analysis of Blood-lead Surveillance Data. Environmental Research 99
(2):214-223.
U.S. Department of Housing and Urban Development. September 15, 1999. Final Rule,
Requirements for Notification, Evaluation and Reduction of Lead-Based Paint Hazards in
Federally Owned Residential Property and Housing Receiving Federal Assistance.
Washington DC: Federal Register, 50140-50231.
87
------- |