United States     Office of Water     EPA 816-R-00-020
 Environmental Protection  (4606)       October 2000
 Aoencv	www epa oov/safewater	
Data Reliability
Analysis of the EPA
Safe Drinking Water
Information System /
Federal Version
 (SDWIS/FED)

-------
Acknowledgements
The following people contributed to this project and the preparation of this report:

Jan Auerbach    Project lead
Lee Kyle        Data quality quantification, principal author of this report
Fran Haertel     Data quality characterization, state-specific data quality reports


Members of the Data Reliability Stakeholders Workgroup

EPA Headquarters:
      Jan Auerbach, Chair; Chief, Information Management Branch (1MB)
      Fran Haertel, Environmental Protection Specialist, 1MB
      Lee Kyle, Statistician, 1MB
      Ken Harmon, Office of Enforcement and Compliance Assistance (OECA)

EPA Regions:
      1: Chris Ryan, Region I
      2: Mark Rasso, Region II SDWIS/FED coordinator
      5: Tom Poleck, Region V SDWIS/FED coordinator
      6: Andy Waite, Region VI SDWIS/STATE coordinator
      8: Aundrey Wilkins; Jack Rychecky, Region VIII Branch Chief
States:
      Florida: Kenna Study, Drinking Water Manager
      Iowa: Dennis Alt, Drinking Water Manager
      Utah: Kevin Brown, Drinking Water Administrator
      Washington: Peggy Johnson, Drinking Water Manager
      Asssociation of State Drinking Water Administrators (ASDWA): Vanessa Leiby,
      Executive Director; Bob Blanco

Industry:
      American Metropolitan Water Administrators: David  Denig-ChakrofT
      American Water Works Association (AWWA): Teryl Pajor, Dan Schechter
      National Association of Water Companies: John Hroncich
Other:

      State Lab: Steve Jennis
      Natural Resources Defense Council (NRDC): Eric Olson, Adriana Quintaro

-------
EXECUTIVE SUMMARY	1

PART I: MANAGEMENT SUMMARY	2

1   Introduction	2

1.1  Background	2
1.2  Perspective/context	4

2   Summary of findings	4

3   Corrective actions	7

3.1  Early actions taken	7
3.2  Actions taken resulting from September 1999 Stakeholder Workgroup recommendations	7
3.3  Future actions planned	8
3.4  Implementation process	9

PART II: DETAILED FINDINGS	12

4   National estimates of the quality of SDWIS/FED data	12

4.1  Data quality defined	12
4.2  Methodology	12
4.3  Perspective/context	13
4.4  Data verifications	14
4.5  Industry surveys	22
4.6  Frozen database comparison—Timeliness estimates	26
4.7  Comparison of SDWIS/FED to Envirofacts	28

5   Additional data quality analyses	29

5.1  States' reporting of violations data	29
5.2  Comparison of states' reporting of Annual Compliance Report (ACR) data to SDWIS/FED data	30
5.3  Error reports analysis—data transfer errors	33
5.4  State structures analysis	34
5.5  State summaries of SDWIS/FED data quality, and recommended improvements	35
Appendix A—Stakeholders Working Group recommendations	37

-------
                          EXECUTIVE SUMMARY


In 1998 EPA launched a major effort to assess the quality of its drinking water data—
data used to assess compliance with the Safe Drinking Water Act, and found that it needs
to be improved. This report is the culmination of that effort.
This report provides specific findings and estimates of the quality of the data that are in,
and should be in, the EPA Safe Drinking Water Information System (SDWIS/FED).
These results in no way should be interpreted as a reflection of drinking water quality,
which overall remains high.
SDWTS/FED is EPA's drinking water database. It contains drinking water data for
approximately 170,000 public water systems serving over 250 million people. Each water
system has inventory data which describe the water system, data on any violations they
have incurred, and resulting enforcement actions taken by states and/or EPA to ensure
drinking water protection.
EPA found that the data quality for a selected subset of the required inventory data
elements is high, that the data quality of violations data is low, and that enforcement
actions data are of moderate data quality. SDWIS/FED data quality findings were found
to be similar across water system types and size categories.

The violations listed in SDWIS/FED are accurate, but they are incomplete. A number of
states have never reported certain types of violations. While industry found a few cases of
over-reporting in the past (which have been corrected), EPA found very little over-
reporting of violations in its analyses.
EPA and states have taken or scheduled a number of corrective actions, which are
described in the Management Summary. These actions include (but are not limited to)
more and improved training on rule implementation and data entry, additional and
revised data audits, and improved data error interpretation. These corrective actions
should improve the quality of the data reported nationally, as well as improve the public's
understanding of the overall high quality of drinking water supplied by most systems in
the United States.

-------
                   PART I: MANAGEMENT SUMMARY

1   Introduction
In 1998, EPA launched a major effort to assess the quality of its drinking water data—
data used to assess compliance with the Safe Drinking Water Act, and found that it needs
to be improved. This report provides estimates of the quality of the data that are in, and
should be in, the EPA Safe Drinking Water Information System (SDWIS/FED). These
results in no way should be interpreted as a reflection of drinking water quality, which
overall remains high.

SDWIS/FED is EPA's drinking water database. It contains drinking water data for
approximately 170,000 public water systems serving over 250 million people. Each water
system has inventory data which describe the water system, data on any violations they
have incurred, and resulting enforcement actions taken by states and/or EPA.
EPA found that the data quality for a subset of 8 required inventory data elements is high,
that the data quality of violations data is low (principally because they are incomplete),
and that enforcement actions data are of moderate data quality.

1.1   Background
In the summer of 1998, some drinking water utility trade associations advised their
members to check the EPA "Envirofacts" website, which contains violations and
enforcement actions information  on individual water systems. This was to prepare them
for possible inquiries from their customers. Some larger utilities found gross errors in the
reporting of violations against their water systems, specifically in cases of "over-
reporting"— violations that never occurred which were listed in SDWIS/FED. Several of
the utilities met with the incoming Assistant Administrator for the Office of Water, J.
Charles Fox, to voice their concerns over the poor quality of the data that were available
to the public.
Also that summer, EPA was preparing the first Annual Compliance Report (ACR), as
required by the 1996 Amendments to the Safe Drinking Water Act. The Amendments
required states to prepare state reports, and for EPA to compile them in a national report.
When EPA compared the  data in the state reports to the data these states submitted to
SDWIS/FED, it found more than a 30% overall difference in data which should have
been virtually identical. These two concerns led the Assistant Administrator to issue a
letter to the states  on September 3, 1998, calling for a major initiative to quantify and
characterize the quality of the data in SDWIS/FED.

EPA began this initiative by holding three national public meetings on SDWIS/FED data
quality in November and December 1998. EPA then formed a data reliability
stakeholders workgroup comprised of people from EPA headquarters and regional
offices, state drinking water programs, water utilities, industry associations, laboratories,
and an environmental non-profit  organization. The workgroup considered the comments
from the public meetings and helped EPA develop a Data Quality Action Plan.

-------
The Data Quality Action Plan, dated December 31, 1998 consisted of 4 major
components:
1.  Establish a SDWIS/FED data quality goal:

    "SDWIS/FED will contain 100% complete, accurate, timely, and consistent data
    which portray the data submitted by public water systems and primacy agencies,
    consistent with the Safe Drinking Water Act (SDWA) requirements. This goal will be
    advanced through interim milestones, which can be set once the current level of
    SDWIS/FED data quality is determined."
2.  Improve the way SDWIS/FED data are presented in the EPA Envirofacts website.
    Several water utilities and other stakeholders raised concerns about what water
    system compliance information was available on the EPA Envirofacts website, and
    how it was displayed.
3.  Take interim actions to improve SDWIS/FED data quality (status of these actions
    taken are discussed in Section 3.1).
4.  Quantify and qualify the quality of SDWIS/FED data.
EPA used several analyses to quantify and characterize data quality. The overall
SDWIS/FED data quality estimates
for inventory, violations, and
enforcement actions data are based
primarily on the findings from the
data verifications analysis, with
input from an analysis comparing
Annual Compliance Report (ACR)
data to data in SDWIS/FED. The
data verifications analysis included
29 data verification audits
conducted in 27 states between
1996 and 1998. A total of 1,857
systems were audited (see Section
4.4 for details).
Analyses used to quantify and characterize
         SDWIS/FED data quality
•   Data verifications— reviews of data in state files that
    provided numerical estimates of overall SDWIS/FED
    data quality
•   Industry surveys—water systems reviews of
    SDWIS/FED data that provided numerical estimates of
    the accuracy of datam SDWIS/FED
•   Frozen database comparison—to develop numerical
    estimates of the timeliness in which violations are
    reported
•   Comparison of SDWIS/FED data to Envirofacts data—
    to check for data transmission errors from one data set
    to the other
    Comparison of states'reporting of 1997 Annual
    Compliance Report (ACR) data to SDWIS/FED data-
    provided ratios of under-/over-reporting, reasons for
    discrepancies
                                        Errors analysis—to evaluate errors incurred in
                                        transferring data to SDWIS/FED
Initial SDWIS/FED data quality
estimates were shared with states
and EPA regions in summer 1999.
Any errors found were checked and
corrected. Some of the states
concerns are discussed in Section 4.4.1.3 of this report. Since then, the other analyses
have been completed, including the industry surveys. The overall findings are presented
in this report.
The outcome of the SDWIS/FED data quality analysis is to provide a benchmark of
SDWIS/FED data quality, and a better understanding of where greater attention needs to
be focused to improve it. The quantitative portion of this analysis also provides a water
system perspective (% water systems having violations or enforcement actions) when

-------
available to provide additional perspective. The qualitative portion ties together
numerical and non-numerical information from a number of analyses in an attempt to
further characterize where the problems are occurring, and why.

1.2   Perspective/context
These results in no way should be interpreted as a reflection of drinking water quality,
which overall remains high. Nor does it question the accuracy of the data submitted by
laboratories or water systems to states (inaccurate lab results, fraud, data falsification,
etc.). The thousands of compliance decisions that are made correctly by state drinking
water programs are not enumerated. Only the violations and enforcement actions appear,
because SDWIS/FED is an exceptions database (in other words, states do not provide
sample data on regulated contaminants; they only report to SDWIS/FED when an
"exception," such as a violation or enforcement action, has occurred). As will be shown,
only a small percentage of systems have any health-based violations. Many states have
taken corrective steps to improve their SDWIS/FED data quality since these data were
gathered.


2   Summary of findings
Summary estimates of SDWIS/FED data quality are presented in Text Box 1. Detailed
estimates are contained in Part II of this report.
Inventory data
       The overall quality of 8 core SDWIS/FED inventory data    Parameters checked:
                                                                  8 inventory
                                                              System status (active or
                                                              inactive)
                                                              Water system type
                                                              Primary source of water
                                                              Population served
                                                              # service connections
                                                              Water system ID
       parameters is high. That is, only 4% of the inventory
       parameters checked had any discrepancies (discrepancies
       are differences in data, missing data, or errors). The two
       parameters that change most frequently—population
       served and number of service connections—had the
       highest discrepancy rates. SDWIS/FED data quality
       estimates are very similar across water system types.
       These results are corroborated by the industry surveys.

Violations data
   •   The overall quality of SDWIS/FED violations data is moderately high (estimated
       at 68%) for the Total Coliform Rule standard (an acute health-effects measure).
       However, it is  very low for other health-based standards including Chemicals,
       Radionuclides, Surface Water Treatment Rule, and for monitoring/reporting
       requirements.
   •   Most of the discrepancies are because of unrecorded and unreported violations.
       This accounts for 56% of all MCL discrepancies, 83% of SWTR TT
       discrepancies,  and 94% of all monitoring/reporting discrepancies. Data flow
       discrepancies (data in state databases but not SDWIS/FED) account for the
       remainder.

-------
   •   The data that are reported in SDWIS/FED are highly accurate overall, in part
       because edit checks reject data which are transferred incorrectly.
   •   Data quality estimates are similar across water system types; this is corroborated
       by the industry surveys.
   •   Very little indication of over-reporting of violations was found (less than 0.7% of
       violation discrepancies).
   •   A number of states have never reported certain types of violations.
   •   Many states are not meeting the 90-day deadline for reporting violations. Only
       68% of violations were reported on time.

   •   Violations reported using the Traditional method (selected data replacement or
       correction) appear to be more timely than those reported using the Total Replace
       method (replacing the entire data set each time changes are made).
Enforcement actions data
   •   SDWIS/FED enforcement actions data were found to be 87% complete and 83%
       accurate. Results were similar across water system types.
Other findings
   •   No discrepancies were found between data in SDWIS/FED and Envirofacts.
   •   "Data entry problems" was the most frequently cited reason for discrepancies
       between ACR data reported by states and SDWIS/FED data. "Resource
       limitations" was the next most common reason for discrepancies.
   •   Using the Traditional data entry method, 20% of inventory data and 32% of
       violations and enforcement actions data are being rejected. It was not possible to
       perform a similar calculation for the Total Replace method.
   •   Only 25% of all states were successful in resubmitting data in their first attempt.
       Of those not successful on the first attempt, 82% of the error types were data entry
       errors. Seven percent or less represent SDWIS/FED software limitations and
       problems.
   •   Characteristics of state programs that result in high quality SDWIS/FED data
       include routine, meaningful communication at all levels; Annual PWS notification
       of monitoring schedules; and automated monitoring compliance determination.

-------
                                            Text Box 1

                  SDWIS/FED Data Quality Summary Statistics

    Inventory data
The SDWIS/FED data quality of 8 inventory parameters checked is estimated to be 96%.

                                 	1  ~ 2,000 systems reviewed times 8 parameters checked
  Number of data points
  Discrepancies:
   Number
   Percent
  SDWIS/FED data
[~~ quality
16,006

  646
 4.0%
                         96%
The # of instances where the DV audit team concluded that the parameter in
SDWIS/FED was incorrect.

SDWIS/FED data quality = % of data without discrepancies (errors)
 I— SDWIS/FED inventory data quality bvparameter. System status (active or inactive)—97%, water system type—97%, primary source
    of water—98%, population served—91%, # service connections—92%, address—95%, name—98%, water system ID—100%.
Violations data

The SDWIS/FED data quality of violations data ranges from 7% for Surface Water
Treatment Rule Treatment Technique (SWTR TT) violations to 68% for Total Coliform Rule
Maximum Contaminant Level (TCR MCL) violations. Violations data listed in SDWIS/FED
are accurate, but not incomplete. In addition, 68% of violations are reported on time.
                                                             The estimates on this line are not part of the data quality
                                                             calculations, but lend perspective: 78% of the 1,857 systems
                                                             reviewed had at least 1 violation of any type during the 1-3
                                                             year period of review for contaminants and rules
                                                            # violations that the DVs determined should have been
                                                            reported to SDWIS/FED, whether or not thev were
                                                             # errors cited in the DVs: 93% were for violations not
  % systems w/ Violations
  Number of Violations
  Discrepancies
   Number
   Percent
    % Completeness
    % Accuracy
    SDWIS/FED data
    quality
TCR Total Other
MCL MCL
6.1%
162
52
32%
68%
99%
68%
< 4.3%
59
50
85%
19%
79%
15%
SWTR
TT
9.6%
94
87
93%
11%
67%
7%
Total
M/R
< 78%
5,091
4,613
91%
10%
95%
9%
^
\
N
\N
                                                           designated by states as violations; remaining 7% occurred
                                                           between state databases and SDWIS/FED
                                                           Completeness: % violations thatshould be in
                                                           SDWIS/FED that made it in
                                                           Accuracy: % violations jn SDWIS/FED that are correct

                                                           SDWIS/FED data quality = % of data without
                                                           discrepancies (errors)
Enforcement actions data
The SDWIS/FED data quality of formal enforcement actions is estimated to be 72%. All
formal enforcement actions, which are issued by the state and/or EPA in response to
violations, are required to be reported to SDWIS/FED.
    % systems with Enforcement Actions

    Number of Enforcement Actions
    Discrepancies
     Number
     Percent
    % Completeness
    % Accuracy
    SDWIS/FED data quality
                                            # errors cited in the DVs. The DV audits only measure
                                            the difference between state databases and SDWIS/FED.
                                            Auditors did not assume there should be an enforcement
                                            action unless the state actually took one
                                            Completeness: % enforcement actions in state files that
                                            made it into SDWIS/FED
                                            Accuracy: % enforcement actions in SDWIS/FED that are correct
                                            SDWIS/FED data quality = % of data without discrepancies {errors)

-------
3  Corrective actions
Text Box 2 defines the 4 elements of SDWIS/FED data quality and correlates their
improvement to the corrective actions discussed below.

3.1   Early actions taken
Before waiting for the results of the analyses which would quantify and qualify the
quality of the data in SDWIS/FED, the data reliability stakeholders workgroup
recommended in December 1998, and EPA subsequently completed, several actions to
improve SDWIS/FED data quality in the interim.

EPA HQ:
•  Improved the way SDWIS/FED data are presented in the EPA Envirofacts website.
   In response to concerns about the quality of older SDWIS/FED data, now only
   violations and enforcement actions incurred since 1993 will be displayed in
   Envirofacts. Beginning in 2003, ten years' worth of data will be displayed.

   EPA also improved the way SDWIS/FED data in Envirofacts are displayed. Major
   changes included combining violations and enforcement actions so that they are
   displayed in the same table (previously, users had to match violations and
   enforcement actions by looking at two different tables and matching the violations
   identification number), showing health-based violations separated from monitoring
   and other violations, and adding links to utilities' Consumer Confidence Reports
   (CCRs) on-line using EPA's new CCR catalog. Better descriptions of what violations
   and enforcement actions are, as well as additional links to state pages and
   contaminant fact sheets, were also added.
•  Prioritized and corrected deficiencies already identified in the data entry process
•  Accelerated the development and implementation of SDWIS/STATE
•  Provided additional error check routines in SDWIS/FED
•  Improved existing data entry tools such as the data entry troubleshooter's guide
•  Accelerated efforts to develop new tools to simplify data retrieval, and accelerated
   efforts to improve existing reporting tools
•  Developed an interim mechanism to enable utilities to confirm their data before they
   are officially accepted in SDWIS/FED
EPA Regions took additional steps to ensure that quarterly submissions are reviewed and
errors are checked prior to the quarterly freeze in SDWIS/FED.
EPA and States drafted quality assurance manuals to help states and regions operate the
drinking water program and report drinking water information.

3.2  Actions taken resulting from September 1999 Stakeholder Workgroup
     recommendations
The Stakeholder Workgroup reviewed the preliminary findings of the analyses used to
quantify and characterize SDWIS/FED data quality in September 1999. Many of the

-------
actions taken or scheduled listed below resulted from their prioritized recommendations,
which are listed in Appendix A.

3.2.1   EPA HQ actions taken
•  Training:
   EPA staff have designed implementation and data reporting training courses for the
   Lead and Copper Rule Minor Revisions (LCRMR) and the Public Notification Rule
   (PN). Several courses have been conducted for states and regions.
   EPA has established a contractual arrangement for states and regions to obtain one-
   on-one, on-site data management assistance.
   EPA has expanded its offering of generic data entry and troubleshooting (i.e.,
   correcting errors) training courses.
•  The SDWIS/FED Edit/Update Summary Report has been completely redesigned to
   fully account for and document the processing results of each data submission file.

3.3  Future actions planned

3.3.1   EPA HQ Actions
•  Provide additional training by:
   Developing a schedule for implementation and reporting training courses for the
   Chemicals/Radionuclides rules, the Surface Water Treatment Rule,  the Total
   Coliform Rule, and developing training courses and materials for each new rule. The
   training will include implementation, compliance determination and reporting
   requirements.
•  Improve the data verifications audits by:
   Revising the Data Verification Protocol to incorporate workgroup recommendations,
   completing a version of the Data Verification Protocol for states to use in conducting
   a self-audit, and completing 11 data  verification audits in FY2000 (more if funds
   allow). If 17 audits were conducted per year, the data quality in each state could be
   assessed every 3 years (audits cost roughly $25,000 each).
•  Complete a version of the error report which managers can use to help them improve
   data entry.
•  Target attention to some states and regions, based on the results of individual state
   analyses and ongoing data verification audits. EPA will conduct meetings to address
   issues, target technical assistance and develop plans of action with such states and
   regions.
•  Continue to calculate  SDWIS/FED data quality including:
   National estimates for SDWIS/FED  data quality at least every 3 years or more
   frequently if data from a sufficient number of data verifications analyses are
   available; ACR vs. SDWIS/FED analysis, national estimates of the timeliness of data
   reporting violations, and the number of states reporting violations by

-------
   contaminant/rule and water system type annually; and error rates by error code
   quarterly.

3.3.2  EPA Regional Actions
•  Conduct the errors analyses quarterly to determine which error conditions are
   occurring most frequently.

3.3.3  State Actions
States may take the following actions to improve data quality, but specific actions in each
state will be contingent on its particular situation.
•  Notify utilities annually of compliance monitoring schedules
•  Implement and participate through Association of State Drinking Water
   Administrators (ASDWA) in peer reviews among states
•  Conduct self-audits using the revised data verifications protocol
•  Share software, tracking systems, and compliance determination modules among
   states that support rule implementation
•  Evaluate current information management systems and consider adopting
   SDWIS/STATE
•  Participate in EPA-provided training for rule implementation, reporting requirements,
   and data entry
•  Develop and implement a quality assurance program
                                                   Potential categories for SDWIS/FED
                                                         data quality goals:
                                                  •  Overall inventory
                                                  •  Overall enforcement actions
                                                  •  Violations:
                                                      TCRMCL  Other MCL
                                                      SWTR XT  LCR TT
                                                      M/R
3.3.4  Joint EPA-State Actions
•  Work together to establish goals for improving
   SDWIS/FED data quality at the national level,
   as assessed through data verifications results.
•  Continue early involvement of states and
   regions in rulemaking with a focus on (1)
   streamlining reporting requirements and (2)
   simplifying  rules to ease interpretation and implementation, including reporting
   requirements.

3.4  Implementation process
The ASDWA/EPA Data Management Steering Committee (DMSC), in conjunction with
the Data Sharing/Data Quality Committee (DSC), will continue to focus on data quality
improvement issues identified in this report, and will propose future corrective actions
and strategies for EPA and States.
Individual state-specific recommendations will be communicated to the states and EPA
Regions through State Summary reports. Joint discussions will be conducted and an
implementation schedule developed. Follow-up activities will be conducted through the
normal mid-year and end-of-year program evaluation process. Generic state corrective
actions will be pursued through the State/EPA annual Workplan process.

-------
Formal implementation could begin as early as FY 2001. Many states have already begun
state-specific corrective actions, as has EPA. Once finalized, appropriate standard
operating procedures will be developed and incorporated in the EPA PWSS Data
Management Quality Assurance Manual.
Collectively, steps already taken by EPA and States, and those planned, are expected to
significantly improve the quality of data in SDWIS/FED. These steps should also
improve public understanding of the high quality of drinking water supplied to consumers
by most water systems in the United States.
                                        10

-------
                    Text Box 2
Improving the 4 Elements of SDWIS/FED Data Quality

There are 4 major elements of data quality:
1 . Completeness— what percent of data that should be in SDWIS/FED is there?
2. Accuracy — how accurate are the data jn SDWIS/FED?
3. Timeliness— what percent of violations data are being reported within a quarter after their
compliance period end dates?
4. Consistency — are the regulations being interpreted consistently?
Actions taken or planned should improve these elements of SDWIS/FED data quality as follows:
Early actions taken by EPA HQ
Improved the way data are presented in the EPA Envirofacts website
Corrected deficiencies in the data entry process
Accelerated the development and implementation of SDWIS/STATE
Provided additional error check routines in SDWIS/FED
Improved existing data entry tools such as the data entry troubleshooter's guide
Accelerated efforts to develop new tools to simplify data retrieval
Developed interim mechanism to enable utilities to confirm their data before they are officially
accepted in SDWIS/FED
Completeness
«£
p
Consistency


X
X

X

X
X
X
X
X
X
X
X

X
X

X




X




EPA HQ actions taken resulting from September 1999 Stakeholder Workgroup recommendations
Designed implementation and data reporting training classes for the LCRMR and PN Rules
Established arrangement for states and regions to obtain one-on-one, on-site data management
assistance
Expanded offering of generic data entry and troubleshooting training courses
Redesigned SDWIS/FED Edit/Update Summary Report
X
X
X
X
X
X
X
X
X
X
X

X



EPA HQ actions planned
Provide additional rule-specific training for existing and upcoming rules, including implementation,
compliance determination and reporting requirements.
Improve the data verifications audits to enable states to conduct self-audits; perform additional audits
Complete a version of the error report which managers can use to help them improve data entry
Target poorer-performing states and regions; conduct meetings to discuss issues, target technical
assistance and develop plans of action with such states and regions
Continue to quantify SDWIS/FED data quality
X
X
X
X
X
X
X
X
X
X
X
X
X
X

X
Benchmark DQ
EPA Regional actions planned
Conduct the errors analysis quarterly to determine which data entry error conditions are occurring
most frequently
X
X
X

State actions planned
Notify utilities annually of compliance monitoring schedules
Implement and participate through ASDWA in peer reviews among states
Conduct self-audits using the revised data verification protocol
Share software, tracking systems, and compliance determination modules among states that support
rule implementation
Evaluate current information systems and consider adopting SDWIS/STATE
Participate in EPA-provkJed training for rule implementation, reporting requirements, and data entry
Develop and implement a quality assurance program
Reduce MIR viols.
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Joint EPA-State actions planned
Work together to establish goals for improving SDWIS/FED data quality, for specific categories of
data, at the national level
Continue early involvement of states and regions in rulemaking with a focus on (1) streamlining
reporting requirements and (2) simplifying rules to ease interpretation and implementation, including
reporting requirements
X
X
X
X
X
X
X
X


                      11

-------
                     PART II: DETAILED FINDINGS

4   National estimates of the quality of SDWIS/FED data
Part II provides details of the analyses conducted and the estimates of SDWIS/FED data
quality. This section provides a definition of data quality, describes the analytical
methodology used, and provides detailed estimates of the quality of inventory, violations
and enforcement actions data in SDWIS/FED.

4.1  Data quality defined
Two questions need to be answered in order to estimate the quality of SDWIS/FED data:
    1. What should be in SDWIS/FED (and is missing)?
    2. How accurate is what is in SDWIS/FED?

There are four major elements of data quality. The first two are essentially a variation on
the two questions above:
    •  Completeness—what percent of data that should be in SDWIS/FED is there?

    •  Accuracy—how accurate are the data in SDWIS/FED?

There are two additional elements of data quality:

    •  Timeliness—what percent of violations data are being reported within a quarter
      after their compliance period end dates? Timeliness is a component of
      Completeness
    •  Consistency—are the regulations being interpreted consistently?


4.2  Methodology

4.2.1  How EPA quantified data quality
This quantification is based on discrepancy rates for inventory, violations and
enforcement action  data. Discrepancy rates are defined as the percent of data that should
be in SDWIS/FED that have errors, are missing, or that do not match between state
databases and SDWIS/FED.
Overall data quality  (for inventory, violations and enforcement actions data) is defined
as the percent of data with no discrepancies. If, for example, 20% of the data have
discrepancies, the SDWIS/FED data quality is 80%.
For violations and enforcement actions data, overall data quality can also be defined as
the multiple of Completeness and Accuracy. Because inventory data are not exceptions-
based data they are  not quantified the same way. Instead, they are quantified as a single
number. Accuracy is conditional on Completeness: it measures the accuracy of the data,
given the data are complete.
For example, if there are:
      100 violations  that should be in SDWIS/FED,
                                      12

-------
        and 60 make it in (Completeness)
        and, of those, 48 are accurate (Accuracy),
        Overall quality=60/l 00*48/60=48%
Timeliness is a component of Completeness and is included in the Completeness
calculations; it was quantified separately in the frozen database analysis. Consistency is
not quantified in this analysis, but is implicit, to some degree, in the data verifications.

4.2.2   Which estimates for which data
Next to each data type (in bold) is a list of parameters for which EPA calculated data
quality estimates; below each are the analyses used to generate these estimates.
                      SDWIS/FED data quality estimates

          Inventory—core data elements:

                  1. status (i.e., water system is active or inactive), 2. type of public water
                  system (e.g., community, transient), 3. primary source of water, 4.
                  population served, 5. number of service connections, 6. address, 7. name,
                  8. PWS ID

                  Overall SDWIS/FED data quality:
                  •   Data verifications analysis
                  •   Industry surveys
          Violations—all violations

                  Overall SDWIS/FED data quality:
                  •   Data verifications analysis
                  Completeness:
                  •   Data verifications analysis
                  Accuracy:
                  •   Data verifications analysis (with input from the Annual Compliance
                     Report vs. SDWIS/FED analysis)
                  •   Industry surveys
                  Timeliness:
                  •   Frozen database comparison
          Enforcement actions—all required to be reported to SDWIS/FED

                  Overall SDWIS/FED data quality:
                  •   Data verifications analysis
                  Completeness:
                  •   Data verifications analysis
                 Accuracy:
                  •   Data verifications analysis
                  •   Industry surveys
4.3   Perspective/context
These results in no way should be interpreted as a reflection of drinking water quality,
which overall remains high. Nor does it question the accuracy of the data submitted by
laboratories or water systems to states (inaccurate lab results, fraud, data falsification,
etc.). The thousands of compliance decisions that are made correctly by state drinking
water administrators are not enumerated. Only the violations and enforcement actions
                                           13

-------
appear, because SDW1S/FED is an exceptions database (in other words, states do not
provide sample data on regulated contaminants; they only report to SDWIS/FED when an
"exception," such as a violation or enforcement action, has occurred). As will be shown,
only a small percentage of systems have any health-based violations. Many states have
taken corrective steps to improve their SDWIS/FED data quality since these data were
gathered.

4.4   Data verifications

4.4.1  Background
The data verifications analysis is the only analysis that assesses the first key component
of data quality for violations and enforcement actions data: Completeness, or the
percentages of these data which should be in SDWIS/FED that are. The data verifications
analysis also yields overall SDWIS/FED data quality estimates for inventory data (as do
the inventory surveys).

The purpose of data verification audits is to determine whether a state is in compliance
with that state's primacy agreement (since late 1996, auditors have considered guidance
from Regions in addition to Federal regulations). Recommendations contained in the
audit are intended to assist states in correcting deficiencies in their program and improve
SDWIS/FED data quality.

An independent contractor has been performing data verifications  since 1991. The
contractor selects a (semi-) random sample of each type of water system in the state.
During an audit, auditors primarily look at state files and database(s). The results are
intended to be representative of the quality of drinking water data throughout the state
with at least an 80% confidence level, and a 7.5% margin of error.

States have the opportunity to review the draft report and provide appropriate
documentation required to adjust or revise the final report. Most states have accepted the
final results of their data verification audits.

Prior to this analysis, data verification reports tabulated the number of systems having
discrepancies. For this analysis, EPA tasked the contractor to re-tabulate the data on a
data point basis—as a true SDWIS/FED data audit. That is, they compared data that
should have been reported to SDWIS/FED to those that actually were reported, and cited
reasons for each discrepancy. Now all data verifications are tabulated in this way.

For this analysis, EPA selected all data verifications done between 1996 and 1998. This
included 29 data verification audits from 27 states. A total of 1,857 systems were audited.
Some of the data verifications focused only on specific rules/contaminants. Results from
the portion of the audit associated with the Lead and Copper Rule  (LCR) are not included
in this analysis due to questions of regulatory interpretation,  which have not yet been
resolved.
                                        14

-------
4.4.1.1   States included in this analysis, by EPA Region:
                    in
                          IV
                                      VI
                                            VII
                                                  VIII
                                                        IX
CT
MA
ME
NH
RI*
VT
VI DE
MD
PA*
WV


AL
FL
GA
NC


MI LA IA SD AZ WA
MN NM NE WY
OK
TX


       * 2 audits were performed
4.4.1.2   Period of review for states reviewed during 1996-1998
Total Colifonn Rule (TCR)
Nitrates
Nitrites
lOCs
VOCs
SOCs
Radionuclides
Total Trihalomethanes
Surface Water Treatment Rule
Enforcement
Most recent four quarters in SDWIS/FED
Most recent three calendar years
1993-1995
1993-1995; back to 1990 if grandfathered
1993-1996; back to 1988 if grandfathered
1993-1995; back to 1990 if grandfathered
Most recent two samples
Most recent four quarters available in SDWIS/FED
Most recent four quarters available in SDWIS/FED
Time period applicable to related violation
4.4.1.3   Summary of some states' concerns about using data verifications results to
         quantify SDWIS/FED data quality

After EPA calculated SDWIS/FED data quality based on the data verifications, it shared
the draft results with the states. Many states accepted the findings, and the methods used
to derive them.

•  One of the states' most widespread concerns was that the public would misconstrue
   the quality estimates as an indication of how well states are running their drinking
   water programs, or as a measure of their drinking water quality. They felt a more
   accurate picture of state data quality would consider all the decisions a state is
   required to make, not just violation decisions. For example, a state may determine
   that a utility monitored correctly in eight out often instances. However, the state
   failed to issue one of the two violations which should have been issued. States noted
   that data quality in  this case was really 90% (eight out of eight appropriate
   monitoring, one out of two failure-to-monitor instances results  in a violation). In this
   analysis, only violation opportunities are considered. Since there were two violation
   opportunities and one of them was missed, this analysis would calculate SDWIS/FED
   data quality in this  instance as 50%.
•  A number of states pointed out that one improper determination could turn into
   multiple deficiencies. For example, if a system, due to being mis-categorized as a
   smaller system, collects 1 coliform sample per month instead of the 2 required, the
   data verifications will list a dozen violation discrepancies for the year. However, EPA
   should note that this is at least partially balanced by the fact that 1 missing sample is
   counted as 1 M/R discrepancy. Some sample bottles are to be used for several
                                        15

-------
    contaminants, which, if missed, would result in up to 30 M/R violations (for a
    missing Synthetic Organic Chemicals (SOC) sample).
•   A few data verifications were targeted to states having known data quality concerns.
    Some state reports are therefore better characterized as the "worst case" scenario.
•   Some Federal requirements had just become effective in the time frame covered by
    the audits and many states were still in the process of adopting state rules and
    developing state data systems. Some of the data discrepancies are a function of
    normal and expected "start-up" problems. Some states felt that a snapshot taken today
    is likely to show a much better picture than one taken 3 years ago because many
    states have made data quality improvements since, and resulting from, their audits.
    Data verifications conducted  in 1999 and after no  longer review the previous
    compliance periods and therefore will be compared to the results in this analysis to
    measure the improvements suggested here.

•   A few states contest their  initial data verification audits. Some states believe that the
    data verification review team overlooked existing data (particularly monitoring
    results) and incorrectly determined that a violation had occurred when it had not. A
    number of states have pointed out errors in the data verifications findings which have
    since been investigated and corrected, and are reflected in this report.
Despite these concerns, EPA believes the findings are representative of SDWIS/FED data
quality at the national level. Even slight biases (some  of which tend to cancel each other
out) do not significantly change the overall findings.

4.4.2  Confidence in findings
This is not a scientific survey and therefore statistical confidence intervals are not
included for most of the point estimates.  However, EPA is confident that the findings
represent the quality of SDWIS/FED data at the national level.

First, the data verifications audits are designed to be representative of the quality of
drinking water data throughout the state with at least an 80% confidence level and a 7.5%
margin of error. In addition, the audits have undergone scrutiny: in the summer of 1999,
states and regions had an opportunity to review the findings of their audits, and any errors
found were corrected.

Second, EPA considers the summation of the 29 audits in 27 states to be representative of
the quality of drinking water data at the national level. This was ascertained after EPA
modeled the individual state findings mathematically  using Bayesian statistics; the
resulting probability curve was found to have a normal distribution.
Third, EPA looked at data quality from many perspectives, and has compared estimates
with the results of other analyses wherever possible. As will be discussed, findings from
other analyses corroborated the data verifications findings.

4.4.3  State Annual Compliance  Report (ACR) vs.  SDWIS/FED
The data verifications analysis has a category for violations discrepancies between state
databases and SDWIS/FED. However, it does not indicate what portion of these
                                        16

-------
discrepancies represent under-reporting (data which is in state databases but not
SDWIS/FED) and over-reporting (data which is in SDWIS/FED but not state databases).
It is necessary to make this distinction in order to yield estimates of Completeness and
Accuracy.
To accomplish this, EPA compared calendar year 1997 ACR data reported using state
databases to 1997 data in SDWIS/FED. EPA calculated ratios of the magnitude of under-
reporting to over-reporting for Chemical, Total Coliform Rule (TCR), and Surface Water
Treatment Rule (SWTR) health-based violations and monitoring/reporting violations.
These ratios were input to the data verifications analysis to enable EPA to calculate
estimates for Completeness and Accuracy for violations data.

4.4.4  Inventory data
4.4.4.1   Estimates by parameter, and overall
Four percent of 8 required inventory data points had discrepancies, or errors. In other
words, the overall SDWIS/FED inventory data quality is estimated to be 96%, as shown
below.
                            Water   ,, .              ..,,
                  Act^nact.  «"   sS P<*^°" iSSt Md™  ""*
PWS
 ID
Overall
    Number of Systems
          Reviewed
       Discrepancies:
            Number
            Percent
     SDWIS/FED data
            quality
2,032
58
2.9%
97%
2,014
61
3.0%
97%
1,997
39
2.0%
98%
1,996
184
9.2%
91%
1,996
161
8.1%
92%
1,979
99
5.0%
95%
1,996
41
2.1%
98%
1,996
3
0.2%
100%
16,006
646
4.0%
96%
Each water system has 1 chance for a discrepancy for each parameter reviewed. The
"Overall quality" column uses the sum of water systems reviewed for each parameter,
which represents the total opportunities for a discrepancy.
The population served and # service connections parameters had the most discrepancies.
A discrepancy in either of these categories is counted as such if the difference is greater
than 10%. Under several drinking water rules, the number of samples required to be taken
is based on the population served and therefore its accuracy is important.
4.4.4.2   Reasons for discrepancies

About one-half of the discrepancies were due to file inconsistencies between data in state
files and the state database(s); another one-third were due to inconsistencies between data
in state database(s) and SDWIS/FED; most of the remaining one-sixth were due to late
reporting, or no data found in state files.
4.4.4.3   Estimates by system type and size
Results from the data verifications analysis were very similar across system types, as
shown below. None of the quality estimates for the 8 parameters listed above differed by
more than 4%.
                                        17

-------
       CWS        97%
       NTNCWS    96%
       TNCWS     95%
Unfortunately, the results of the data verifications analysis cannot be categorized by
system size. The only way to get any approximation using this data is to look at system
types as a proxy for system  size. The information below lists the average population
served by system type (from the 98Q4 frozen database, which was frozen in January
1999).
       CWS       4,645
       NTNCWS    308
       TNCWS      175

The average system size for NTNCWSs and TNCWSs is in the Very Small size category
(25-500 population served), and for CWS the Medium size category (3,301-10,000). If
these results can serve as a proxy for system size, then it appears that data quality may be
similar across size categories. The industry surveys, discussed later, provide a direct
measure of SDWIS/FED inventory data quality by system size so this report addresses
this issue in Section 4.5.3.3.

4.4.5  Violations data
4.4.5.1  Estimates by violation type
Listed below are SDWIS/FED data quality estimates for violations data. The first line of
the table shows the percent of systems (by violation type) having any violations.

Less than 10.4% of all systems audited in the data verifications had any Maximum
Contaminant Level (MCL) violation, and less than 10% of the surface water systems
audited had Surface Water Treatment Rule (SWTR) Treatment Technique (TT)
violations. The estimate that slightly less than 78% of systems had M/R violations is
based on the finding that 78% of all systems audited had at least one violation of any
type, and M/R violations account for 94% of all violations. The 78% estimate also
includes the small number of systems which only had LCR violations (earlier versions of
the analysis included estimates for LCR, and it was not possible to subsequently remove
LCR from this statistic).
These percentages of systems having violations lend a systems perspective. They are  not
part of the calculations of SDWIS/FED data quality, which is based on a data point
perspective. The remainder  of the table reflects a data point perspective.
                                        18

-------
       % systems w/
       violations

       Number of Violations
       Discrepancies
        Number
        Percent

       % Completeness
       % Accuracy
       SDWIS/FED data
       quality
TCR Total Other
MCL MCL
Total SWTR
MCL TT
6.1% <4.3% <10.4%
162
52
32%
68%
99%
68%
59
50
85%
19%
79%
15%
221
102
46%
55%
97%
54%
9.6%
94
87
93%
11%
67%
7%
Total
M/R
<78%
5,090
4,613
91%
10%
95%
9%
Legend:
•   TCR: Total Coliform Rule,
    applicable to all water systems.
    Coliforms pose an acute health risk
•   MCL: Maximum Contaminant Level
    violation
•   TT: Treatment Technique violation
    (MCLs and TTs are health-based
    violations)
•   M/R: Monitoring/Reporting
    violation
TCR MCL data will serve as an example to describe this table:

•   6.1% of systems reviewed incurred or should have incurred a total of 162 MCL
    violations
•   Of the 162 violations, there were 52 discrepancies, or errors. The discrepancy rate is
    32%, and the corresponding SDWIS/FED data quality estimate is 68% (100%-32%).
•   Completeness and Accuracy—68% of the violations that should be reported in
    SDWIS/FED made it in, and of the violations in SDWIS/FED, 99% are accurate.

According to these estimates, roughly 2/3 (68%) of all TCR MCL violations were
reported completely and accurately. The SDWIS/FED data quality is 15% for Other
MCLs, 7% for SWTR TTs, and 9% for M/R violations.

Overall, the data that do make it into SDWIS/FED are accurate. In fact, 99% of the TCR
MCL violations, 79% of Other MCL violations, and 95% of M/R violations listed in
SDWIS/FED are accurate. However, only, 2/3 of SWTR TT violations listed in
SDWIS/FED are estimated to be accurate. In other words, there may be some over-
reporting of SWTR TTs in SDWIS/FED.

The weak link in data quality is the large number of violations that never make it to
SDWIS/FED (as estimated by Completeness). Only 1 out of every 9 SWTR TT violations
that should be in SDWIS/FED make it in (11% Completeness), and  only 1 out of every
10 M/R violations make it in.

4.4.5.2  Reasons for discrepancies

The data verifications include several categories, or reasons, for violations discrepancies.

        Reason
        Not In state database
        No data found in state files
        Insufficient samples
        Different implementation policies
        Other
        In state database(s) but not SDWIS/FED
        In SDWIS/FED but not state database(s)
        Total
TCR Other
MCL MCL
0
0
31
4
16
1
52
0
0
22
0
26
2
50
Total SWTR
MCL TT
0
0
53
4
42
3
102
0
0
72
0
11
4
87
M/R
3,492
205
417
56
252
25
4,613
M/R violations discrepancies account for the majority (94%) of all the discrepancies,
with the largest category being "no data found in state files." This category applies to
                                         19

-------
M/R violations only. If, for example, required sample results could not be found in any
state files, a discrepancy would be cited if the state did not issue a M/R violation. This
could also occur if water systems were told they could reduce the monitoring frequency
for some requirements, but no record of a waiver having been issued was found.
The "Different implementation policies" category means that the state did not determine
compliance in accordance with their state primacy agreement. Since late 1996, auditors
have also been factoring in any additional guidance provided by EPA Regional offices.
Thus, as long as a state acts in accordance with its own EPA-approved regulations, or
formal interpretive guidance issued by the region, no discrepancy is issued.
The last category listed, "In SDWIS/FED but not state database(s)," represents over-
reporting. AH the other categories represent under-reporting. Overall, 99.3% of all
violation discrepancies found in the data verifications analysis are estimated to be from
under-reporting. Only 32 out of the 4,802 violation discrepancies found (<0.7%) are
estimated to be from over-reporting. These estimates are based in part on ratios of under-
to over-reporting identified from the ACR analysis.
Most violations discrepancies are related to compliance determination at the state level,
which consist of violations which never made it into state databases. The remaining
discrepancies (i.e., those listed in the last two rows of the above table) are related to data
flow between state files and SDWIS/FED. However, since monitoring/reporting
discrepancies comprise 94% of the total number of discrepancies an overall number
shouldn't represent this. A more precise picture is portrayed when the discrepancy
categories are  analyzed by violation type:
       Breakdown of           TCR   Other   Total   SWTR
       discrepancies           MCL   MCL    MCL     TT     M/R
       Compliance determination
       Dataflow
67%    44%    56%    83%   94%
33%    56%    44%    17%    6%
As shown in the table above, only one-third of all TCR MCL violation discrepancies
occur between state files and SDWIS/FED (17/52). Over one-half of Other MCLs
(28/50), one-sixth of SWTR TTs (45/102), and only 6% of all monitoring/reporting
violation discrepancies (277/4,613) occur between state files and SDWIS/FED. Other
analyses will look at some reasons for these data flow discrepancies, including the frozen
database analysis, which looks at Timeliness (were some violations merely entered late?),
and the errors analysis (were some violations rejected at data entry?).
4.4.5.3  Estimates by rule/contaminant
The data verifications also listed violations data by rule/contaminant. There were neither
sufficient data points to calculate quality estimates for some of the Chemical MCLs, nor
sufficient data points to calculate estimates of Completeness and Accuracy. Again, a
systems perspective, listing the percentage of systems having any violations,  precedes the
SDWIS/FED data quality estimates.
                                        20

-------
        # systems reviewed
        # systems w/ violations
        % systems w/ violations
        # Violations
        # Discrepancies
        % Discrepancies
        SDWIS/FED data
        quality
        # systems reviewed
        # systems w/ violations
        % systems w/violations
        # Violations
        # Discrepancies
        % Discrepancies
        SDWIS/FED data
        quality

TCR
1,857
113
6.1%
162
52
32%
68%

IOCS
1,025
10
1.0%
12
11
92%
8%

Nitrate
1,489
19
1.3%
37
32
86%
14%
* insufficient data
TCR
1,857
480
26%
1,289
1,034
80%
20%
IOCS
1,025
175
17%
193
174
90%
10%
Nitrate
1,489
507
34%
964
844
88%
12%
MCLs
Nitrite
1,489
3
0.2%
4
3
75%
25%
MVRs
Nitrite
1,489
224
15%
235
210
89%
11%

SOCs
1,025
0
0.0%
0
0
0%
*

SOCs
1,025
263
26%
877
864
99%
1%

VOCs
1,026
2
0.2%
2
1
50%
*

VOCs
1,026
257
25%
722
686
95%
5%

TTHMs
83
1
1.2%
1
0
0%
*

TTHMs
83
6
7%
12
12
100%
0%
TTs
Rads SWTR
523
2
0.4%
3
3
100%
*
395
38
9.6%
94
87
94.7%
5%

Rads SWTR
523 395
111 83
21% 21%
163 635
161 628
99% 99%
1% 1%
                       Legend:
                       MCL: Maximum Contaminant Level violation
                       TT: Treatment Technique violation
                       (MCLs and TTs are health-based violations)
                       M/R: Monitoring/Reporting violation

                       TCR: Total Coliform Rule
                IOC: Inorganic Chemicals
                SOCs: Synthetic Organic Chemicals
                VOCs: Volatile Organic Chemicals
                TTHMs: Total Trihalomethanes
                Rads: Radionuclides
                SWTR: Surface Water Treatment Rule
SDWIS/FED data quality for TCR data is significantly higher than for other rules or
contaminants. The data quality estimates for SOCs, TTHMs, Rads, and SWTR averaged
1% quality, or less. The vast majority of these discrepancies were due to under-
reporting—specifically, no data found in state files. In other words, no more than 1 of
every 100 SOC, TTHM, Rad, and SWTR M/R violations are reported to SDWIS/FED.

4.4.5.4  Estimates by system type
These data by water system type were similar enough to be counted together, as shown
below. By doing so, the accuracy of our estimates increased significantly, since more data
points yield better estimates.
                  TCR MCL   SWTRTT
M/R
cws
NTNCWS
TNCWS
Overall
69%
67%
68%
9%
11%
0%
9%
7%
14%
68% 7% 9%
Again, there are not enough data points to calculate estimates for Completeness and
Accuracy, nor are there sufficient data to estimate quality of Other MCLs by system type.
Unfortunately, the results of the data verifications analysis cannot be sorted by system
size. The industry surveys can be categorized in this way, as will be discussed later.
                                           21

-------
4.4.6  Enforcement actions data
4.4.6.1  Estimates by system type, and overall
Estimates for formal SDW1S/FED enforcement actions data quality are preceded by a
systems perspective.
                                     CWS NTNCWS TNCWS    Total
       # Systems Reviewed            '
       # Systems with Enforcement Actions
       % Systems with Enforcement Actions
       # Enforcement Actions
       # Discrepancies—under-reporting
       # Discrepancies—over-reporting
       # Discrepancies—incorrect reporting
       Total discrepancies
       % Discrepancies

       % Completeness
       % Accuracy
       SDWIS/FED data quality
696
163
23%
505
55
37
29
121
24%
89%
85%
76%
548
122
22%
305
53
17
22
92
30%
83%
85%
70%
562
75
13%
222
29
24
21
74
33%
87%
77%
67%
1,806
360
20%
1,032
137
78
12
287
28%
87%
83%
72%
Data in the "Total" column will serve as an example to describe this table:
•   System perspective—of the 1,806 systems reviewed in the audits, 360, or 20%, had
    enforcement actions.
•   Of the 1,032 enforcement actions listed for these 360 systems, there were 287
    discrepancies. The discrepancy rate is 287/1,302 or 28%.
•   Overall, 87% of the data that should be in SDWIS/FED make it in (Completeness),
    and 83% of the enforcement actions in SDWIS/FED are accurate.
The calculation for Completeness is based on the number of discrepancies that represent
under-reporting. Here the data verifications are clear as to which actions were not
reported to SDWIS/FED. The calculation for Accuracy is based on Over-reporting
(missing from state files) as well as for Incorrect reporting (which occur if the  dates listed
in SDWIS/FED are off by more than a month).
The quality estimates are similar across system types: all were within 4% of the
combined average.

4.5  Industry surveys

4.5.1  Background
Both the National Rural Water Association (NRWA),  in conjunction with the Association
of Drinking Water Administrators (ASDWA), and the American Water Works
Association (AWWA) volunteered to survey their water systems.
The objective of this effort was to get data quality estimates from water systems directly.
Indeed, this is the only analysis that goes upstream of state records.  From this analysis
EPA derived overall SDWIS/FED data quality estimates for inventory data, and
Accuracy estimates for violations and enforcement actions data. Operators were not
                                         22

-------
asked to assess the Completeness of violations and enforcement actions data, but only the
Accuracy of those listed in SDWIS/FED. Another objective of this effort was to provide
states with feedback from this effort to help them investigate and correct potential errors
that may exist. Any corrections they make will be reflected in SDWIS/FED in the next
quarterly update after the state's corrections are submitted to SDWIS/FED.

4.5.2  Survey design
Water systems surveyed received a printout of their inventory, violations, and
enforcement action data from SDWIS/FED. Water system operators were asked to
indicate whether each data point was correct, incorrect, or to indicate "DK" if they did
not know. Each data point marked "DK" was removed from the survey analysis so as not
to artificially lower the discrepancy rates.

AWWA sent surveys to  all water systems serving more than 10,000 people that incurred
at least one violation between FY1993 and FY1997. Of the 2,222 surveys sent, 684 were
completed and returned, resulting in a 31% response rate (25% is a typical response rate
for mailed surveys).
NRWA/ASDWA surveyed active, current systems serving fewer than 10,000 people that
incurred at least one violation between FY1993 and FY1997. A random sample of 40
CWSs and 5 NTNCWSs were selected for each state. Of 2,549 surveys sent, 439 were
completed and returned from 23 states. The response rate was  17% overall, and 39%
from the 23 states that participated.
As discussed below, in both surveys, water systems that did not respond had a higher
average number of violations than those that did. The effect of this self-selection bias on
the results of this analysis is unclear.
Two systems were removed from the AWWA survey. A system in New Jersey disputed
all of their 751 violations. This may be a case of over-reporting, but the inclusion of this
single system in the survey would have resulted in overall discrepancy rates four times
higher. A system in PA with 718 violations was removed because it was not clear how to
categorize their violations. On their survey sheets, the water system indicated "DK." In a
letter they sent with their completed survey they did not dispute any of them, and in fact
explained how several of them occurred. In a telephone interview, they disputed all of
them.

An EPA contractor conducted telephone interviews with 7 water system operators  to
evaluate how they filled out the survey, and to investigate potentially "extreme"
responses—water systems which disputed either all or none of their violations data.
The contractor found the presence of response bias in the violations and enforcement
actions responses: a number of water systems contacted said they left violations and
enforcement actions data points blank, rather than indicating "DK," if they were unsure
whether the data points were correct or not. The magnitude of this bias is unclear.
                                       23

-------
4.5.3  Inventory data
Operators did a thorough job in evaluating their inventory data. Since each water system
had 7 required data points to evaluate, the results from each water system are counted
equally. This is in contrast to violations and enforcement actions data, where a few
systems having hundreds of violations, for example, significantly increase the average
violations discrepancy rates. The result shows a fairly high degree of confidence in these
inventory estimates.
4.5.3.1   Estimates by required parameter, and overall
                       Status:
                     Actjve/lnact.
 Water
system  Primary  Population  *Se™ce  Address
 '*"    SOUrce          rnnnnrhnnc
 Type
                                                     connections
Name   Overall
       AVWVA survey
       NRWA survey
       Data verifications
100%
99%
97%
100%
98%
97%
90%
97%
98%
85%
84%
91%
86%
85%
92%
87%
87%
95%
97%
97%
98%
92%
93%
96%
Public water system identification number (PWS ID) was not assessed in the surveys. A
discrepancy in the Population served and # service connections is counted as such if the
difference is greater than 10%, as was done in the data verifications analysis.
As shown above, the overall SDWIS/FED data quality estimates are very close to but
slightly lower than the estimates from the data verifications analysis. The surveys
estimated slightly lower SDWIS/FED data quality for primary source (AWWA survey
only), population served, service connections, and address.
The surveys also asked water system operators to evaluate some optional data
parameters. This information was requested to estimate the quality of the currently
optional data that would become required as of January 2000.
4.5.3.2  Estimates by additional parameters
Primary
Contact
93%
94%
Phone
65%
70%
Owner
category
96%
91%
County 1
97%
91%
County 2
100%
100%
Principal
city
56%
76%
Principal
county
95%
98%
       AWWA survey
       NRWA survey
4.5.3.3   Estimates by system type and size, for required data
                     System type
                         Size category
       AWWA survey
       NRWA survey
       Data verifications
       NRWA survey
       Data verifications

cws

NTNCWS

92%
92%
97%
97%
96%
             NRWA survey
             AWWA survey
Very Small
Small
Medium
Large
Very Large
93%
92%
91%
92%
91%
These estimates by system type are slightly lower than those estimated by the data
verifications for CWSs, and they are very close for NTNCWSs.
As discussed above, EPA was not able to calculate SDWIS/FED data quality estimates by
system size category in the data verifications analysis. Fortunately, EPA was able to do
this in the industry surveys. As shown above, the results across size categories are very
close and show high data quality for required inventory data selected for this analysis.
                                         24

-------
4.5.4  Violations data
As described above, the surveys yielded estimates of the Accuracy of the data in
SDWIS/FED; they did not assess the Completeness of the data (the % of data that should
be in SDWIS/FED that made it in). These Accuracy estimates are of uncertain value, due
to the fact that some water system operators may have left data points blank because they
were not sure whether or not a violation was correct (instead of indicating that they did
not know). This dilutes the discrepancy rates to an unknown degree.
In addition, there may be some non-response bias: water systems included in the survey
that did not respond averaged 40% more violations in the AWWA survey  and 58% more
violations in the NRWA survey than those that did. The effect of this bias is unclear.
Ninety-six percent (96%) of systems in the AWWA survey, and 91% in the NRWA
survey, did not dispute any of their violations. Overall, these Accuracy estimates are very
close to those from the data  verifications analysis.
4.5.4.1  Accuracy estimates by violation type
                    Total MCL SWTR TT Total M/R
       NRWA survey
       AWWA survey
       Data verifications
96%
99%
97%
91%
99%
67%
97%
96%
95%
The Accuracy estimates for Total MCLs and Total M/Rs are very similar to those from
the data verifications analysis. However, the surveys estimated a higher Accuracy of
SWTR TTs than did the data verifications.

4.5.4.2  Accuracy estimates by rule/contaminant
                                      MCLs
                                                                       TTs
                    TCR   IOCS  Nitrate  Nitrite  SOCs  VOCs  TTHMs  Rads   SWTR   LCR
       NRWA survey
       AWWA survey
97.0%
99.4%
n/a
100%
89.9%
100%
*
*
*
100%
100%
100%
*
97.0%
100%
100%
90.6%
99.4%
100%
96.8%
                     * insufficient data
                                      M/Rs
TCR IOCS Nitrate Nitrite SOCs VOCs TTHMs Rads SWTR LCR
NRWA survey
AWWA survey
93.6% 100% 94.2% 100% 100% 99.7% * 100% 100% 88.1%
91.7% 99.2% 96.8% 100% 76.7% 98.4% 94.0% 100% 69.3% 100%
                     * insufficient data
Overall, the surveys estimate very high Accuracy of these contaminants/rules. In other
words, water system operators disputed very few of the violations listed in SDWIS/FED.

Unfortunately, EPA was not able to calculate comparable Accuracy estimates for
violations data from the data verifications analysis. There were an insufficient number of
data points available at the rule/contaminant level. Therefore, a direct comparison of the
results listed above to the data verifications analysis is not possible.
                                        25

-------
4.5.4.3  Accuracy estimates by system type and size
AWWA survey
NRWA survey
"
System
type
CWS
CWS
NTNCWS
Accuracy-
all violations
95%
97%
95%
                                                NRWA survey
                                                AWWA survey
Size
category
Very Small
Small
Medium
Large
Very Large
Accuracy —
all violations
97%
96%
95%
96%
90%





Overall Accuracy estimates are very close across system types. By system size they are
very close as well, with the exception that Very Large systems are 5-7 percentage points
lower.

4.5.5  Enforcement actions data
Again, these Accuracy estimates are of uncertain value, due to the fact that some water
system operators may have left data points blank instead of indicating "DK" on their
surveys. This dilutes the discrepancy rates to an unknown degree.
4.5.5.1  Accuracy estimates by system type and size
                    System type Accuracy
           Size category Accuracy
       AWWA survey
       NRWA survey
       Data verifications
       NRWA survey
       Data verifications

CWS


NTNCWS
98%
99.6%
89%
99%
83%
NRWA survey
AWWA survey
Very Small
Small
Medium
Large
Very Large
99%
99.8%
99%
99%
98%
The Accuracy estimates are higher than those estimated in the data verifications analysis.
In addition, the survey findings indicate that the Accuracy is very similar across system
types and sizes.

4.6  Frozen database comparison—Timeliness estimates
A violation is due to be reported to SDWIS/FED within 90 days after its compliance
period end date. This analysis quantifies how long it has taken for FY1997 violations to
be reported.
SDWIS/FED databases have been "frozen" quarterly since 1997. These frozen databases
enable EPA to look at what data were in SDWIS/FED during set time periods. This
analysis compares fiscal year 1997 data reported in each of the seven quarterly databases
frozen since January 1998.

The following estimates are based on violations reported to SDWIS/FED by July 1999.
This analysis assumes that all FY1997 violations which were going to be reported
actually were reported by July 1999 (7 quarters after all violations were due).  Data for
North Carolina is not included in these estimates. Their reporting of violations data was
highly erratic, which skewed the results.

There were 137,978 violations in FY1997 with end dates at or before September 30,
1997, which were due to be reported by December 31, 1997. Similarly, there were an
                                        26

-------
additional 36,937 violations with end dates between October 1 and December 31, 1997,
and these were due to be reported by March 31, 1998. Therefore, a total of 174,915
FY1997 violations were due to be reported not later than March 31, 1998.

There were also 12,849 violations having end  dates later than December 31,  1997; they
are not included in this analysis. Most had significantly later end dates and would not be
due to be reported before July 1999.
       Timeliness
       # violations
       reported
       # that should have
       been reported
       % reported by each
       frozen database
97Q4,
frozen
Jan -98
94,484
137,978
68%
98Q1,
frozen
Apr '98
118,318
174,915
68%
98Q2,
frozen
JuT98
153,988
174,915
88%
98Q3,
frozen
Oct'98
158,752
174,915
91%
98(24,
frozen
Jan '99
170,793
174,915
98%
99Q1,
frozen
Apr '99
170,647
174,915
98%
99Q2,
frozen
Jul '99
174,915
174,915
100%
                           174,915
                                          # violations due to be reported
                   137,978
                    97Q4    98Q1
98Q2    98Q3    98Q4
  # violations reported
99Q1
9902
At the national level, this analysis indicates that 68% of FY1997 violations that should be
in SDWIS/FED by December 31, 1997 made it in on time, and that 68% of all the
violations that should have been reported by March 31, 1998 actually were reported by
then. Late reporting is a component of Completeness. As can be seen above, late
reporting is a significant problem.
It was not possible to factor Timeliness estimates into the SDWIS/FED data quality
estimates since the period of review for most contaminants/rules in the data verifications
audits was primarily 1993-1998—most of which occurred before late 1997 when EPA
began to "freeze" SDWIS/FED databases.

EPA was able to categorize Timeliness using the two methods of data entry to
SDWIS/FED. EPA used data from the errors analysis to determine which state used
which data entry method.
•  Traditional method, wherein only new, modified, or deleted information is
   transmitted

•  Total Replace method wherein the state sends a complete data  set every quarter and
   totally over-writes all data previously submitted.

Violations appear to be reported in a more timely manner when the Traditional method is
used to report violations,  compared to the Total Replace method:
                                        27

-------
     % reported by each
     frozen database
     Total Replace method
     Traditional method
97Q4,
frozen
Jan '98
63%
71%
98Q1,
frozen
Apr '98
60%
71%
98Q2,
frozen
Jul '98
75%
94%
98Q3,
frozen
Oct'98
75%
98%
98Q4,
frozen
Jan '99
92%
100%
99Q1,
frozen
Apr '99
93%
100%
99Q2,
frozen
Jul '99
100%
100%
                    Traditional
                    method i
                                                    Total Replace
                                                    method
                    97Q4    98Q1
98Q2   98Q3    98Q4    99Q1
                                                                99Q2
As can be seen, many states have been adding and modifying FY1997 violations data
several quarters after they were due. Through 1996, there does not appear to have been
nearly as much volatility in the data. This may be the result of attention focused on
correcting discrepancies between State Annual Compliance Reports and SDWIS/FED
which were first identified in the 1996 and 1997 reports. An almost identical trend is
occurring with FY1998 data, as illustrated below:
                 # violations listed in each database frozen since
                 Jan '98 for FY1997 data, and Jan '99 for FY1998 data
                     FY1998
                                             FY1997
                       North Carolina data not included (their FY 1997 reporting was
                       highly erratic, and this would have skewed the results).
4.7   Comparison of SDWIS/FED to Envirofacts
The public sees SDWIS/FED data as displayed in Envirofacts, EPA's multimedia
website. One aspect of this analysis was to compare data in SDWIS/FED to Envirofacts
to ensure that no errors are introduced in transfer of data from SDWIS/FED to
Envirofacts. All data from 250 water systems selected at random were compared in the
two databases to identify any data transfer errors. No errors were found.
                                         28

-------
5  Additional data quality analyses

5.1  States' reporting of violations data
As part of a further analysis of under-reporting identified initially in the data verifications
analysis, EPA looked at the Annual Compliance Report (ACR) comparison to
SDWIS/FED. The ACR vs. SDWIS/FED analysis indicated that several states did not
report any violations at all in CY1997 for certain contaminants/rules. Some of the non-
reporting was attributable to late reporting. Some states that did not report in 1997 had
reported in other years.

To factor out late reporting, and to  get a more comprehensive picture of non-reporting of
certain violations by state, EPA queried the SDWIS/FED database frozen in October
1999 and listed all violations reported by each state between FY1993 and FY1998. It
found that over a dozen states have never reported chemical rule violations for any
NTNCWSs or TNCWSs, and half have never reported Radiological rule violations for
CWSs.
Clearly, some of the non-reporting  is attributable to states simply not having any
violations to report. However, in light of the magnitude of under-reporting estimated in
the data verifications analysis, and given the percentages of systems estimated to have
violations, by rule, many of these "blanks" represent a problem. These "blanks" are being
evaluated in state-by-state summaries of SDWIS/FED data quality.
The two tables below only include situations where the state has certain systems subject
to a rule. One state has no NTNCWSs (Alaska), and the SWTR has no impact in  states
without any surface water systems in a system type category. One state/territory has no
surface water CWSs, 7 have no NTNCWSs, and  13 have  no surface water TNCWSs.

5.1.1   Number of the 52  states/territories that have never reported any violations,
       by rule,  between FY1993 and FY1998
Below is a list of states/territories that have never reported a violation in this six-year
period.


        cws
       NTNCWS
        TNCWS

This can also be shown in percentages, which will facilitate a comparison with the next
table.
TCR Chemicals RADs LCR SWTR
Ma M/R MCL M/R MCL M/R TT M/R TT M/R
0 1
0 0
1 1
5 8
13 14
12 14
25 23
Rule applies to
CWS only
21 1
25 4
Does not apply
4 18
16 27
11 20
cws
NTNCWS
TNCWS
TCR Chemicals RADs LCR SWTR
MCL M/R MCL M/R MCL M/R TT M/R TT M/R
0% 2%
0% 0%
2% 2%
10% 15%
25% 27%
23% 27%
48% 44%
Rule applies to
CWS only
40% 2%
49% 8%
Does not apply
8% 35%
36% 60%
26% 48%
                                      29

-------
5.1.2  Percent non-reporting of violations, by type, between FY1996 and FY1998

It is informative to look at the percentage of non-reporting, to count the percentage of
"blanks" in each year. The table below lists the percent of non-reporting that occurred
between FY1996 and FY1998 by contaminant/rule. There are 156 opportunities to report
violations in each box below (52 states*3 years), less the number of states not counted, as
described above.
cws
NTNCWS
TNCWS
TCR Chemicals RADs LCR SWTR
MCL MIR MCL M/R MCL MIR TT MIR TT MIR
0% 3%
2% 5%
5% 6%
22% 28%
46% 38%
40% 36%
62% 56%
Rule applies to
CWS only
55% 28%
70% 39%
Does not apply
17% 63%
55% 81%
53% 72%
This shows that almost all states have reported both TCR MCL and M/R violations in
each year. The other rules have had significantly less reporting. For each rule and
violation type, the most reporting has been done for CWSs. NTNCWSs and TNCWSs
fared about the same as each other, but were reported less frequently than CWSs.

Comparing this table to the one above it shows that the percentage of "blanks" is in some
cases significantly higher than when merely considering states that have never reported.
In other words, the states that have reported for specific rules/contaminants have not done
so in each year. For example, 10% of states have never reported a Chem MCL (which
accounts for  10% of the "blanks"), but there are 22% "blanks" for Chem MCLs.

5.1.3  Percent non-reporting of violations, by year

The level of non-reporting in each year has been  fairly steady, although it increased in
1998. EPA calculated the statistics below by dividing the total number of "blanks" each
year by the total number of opportunities to report violations.
       1998
       1997
       1996
       1995
       1994
       1993
64%
59%
58%
58%
53%
58%
Again, some of these blanks represent states that simply had no violations in a category
during a year. State-by-state summaries of SDWIS/FED data quality take a closer look at
this issue.

5.2   Comparison of states' reporting of Annual Compliance Report (ACR)
      data to SDWIS/FED data

5.2.1   Background
This analysis highlighted differences between data in state databases and files and
SDWIS/FED. These differences were analyzed numerically using the 1997 ACR data.
States were also asked to identify reasons for discrepancies between what they reported
for the 1996  and 1997 ACR and what is in SDWIS/FED.
                                      30

-------
5.2.2  Under- and over-reporting between state databases and SDWIS/FED
In this exercise, EPA calculated ratios of under- to over-reporting. These results were
also used in the data verifications analysis.
The data verifications list discrepancies between state databases and SDWIS/FED, but do
not divide them into over-reporting, under-reporting, and incorrect reporting (in the case
of incorrect reporting, the violation exists in both databases but does not match). In order
to calculate estimates for Completeness and Accuracy, EPA to ascribed discrepancies to
either over-reporting or under-reporting (it is not possible to get numerical estimates of
incorrect reporting). This will also enable EPA to compare accuracy estimates from the
data verifications analysis to the industry surveys.
The ACR vs. SDWIS/FED analysis used 1997 ACR data. The ratios of under-reporting to
over-reporting, by rule and overall, are shown below:
  TCR         Chem          SWTR         LCR                 Total            Overall
 MCL    MIR   MCL     M/R     TT    M/R    TT   M/R      MCL    TT    M/R      Total
 21.0    3.0 ]  10.5   136.5 |   3.3    37.3  [ 14.6   19.0 [   [  16.3    4.5    16.2|  [  15.5

Significantly more under-reporting than over-reporting of violations was found. For
example, of the 1997 ACR violations reported using state databases vs. SDWIS/FED, the
magnitude of overall under-reporting was more than 15 times as great as the magnitude
of over-reporting.
Here is how these estimates were calculated:
First, EPA excluded states that reported using SDWIS/FED, since EPA wants to compare
what is in state databases to what is in SDWIS/FED. EPA also excluded  Chemical M/R
violations for one state that listed 21,807 violations in their state database and only 98 in
SDWIS/FED; these numbers were an anomaly, and they skewed the overall results.
Next, instances of over-reporting and under-reporting were summed separately. For each,
differences were taken (between the totals for violations in state databases and in
SDWIS/FED).
Finally, the difference, or number of discrepancies, for under-reporting was divided by
the difference for over-reporting.

5.2.3  Minimum discrepancy rates between state databases and SDWIS/FED
Along with the ratios calculated above, it is informative to look at the discrepancy rates
between 1997 ACR data in state files and SDWIS/FED. These estimates are listed below:
  TCR         Chem          SWTR         LCR                 Total            Overall
  MCL    M/R   MCL    M/R      TT   M/R   TT   M/R       MCL      TT  M/R       Total
  15%   31%   40%    41%     20%   38%  86%   68%       18%   26%   39%       37%
These discrepancy rate estimates are minimum estimates since in order to generate them
EPA had to assume that all violations match between state databases and SDWIS/FED.
For example, if there are 6 violations in a state's database and 10 in SDWIS/FED, we've
assumed that these 6 match, resulting in 4 instances of over-reporting. The discrepancy
rates generated from this analysis are also understated because the discrepancy rate uses
the maximum value in the denominator in order that discrepancy rates do not exceed
100%.

                                       31

-------
Another way of looking at these results is to see how well the data match between state
databases and SDWIS/FED. For example, TCR MCL data have an estimated minimum
discrepancy rate of 15%; this means that a maximum of 85% of the data match.
Maximum correlation estimates are shown below:
  TCR         Chem          SWTR         LCR                  Total            Overall
  MCL  MIR    MCL    M/R     TT     M/R    TT   M/R      MCL    TT    M/R      Total
  85%   69%    60%    59%    80%   62%   14%   32%       82%    74%  61%       63%
Overall, roughly 2/3 of the data in state databases and SDWIS/FED match. LCR TTs had
the lowest correlation estimate of 14%.

Here is how these estimates were calculated:
Again, EPA used 1997 data, only included states that reported using their own databases,
and excluded Chem M/R violations from the state with huge underreporting.
EPA separated the minimum and maximum value of each pair of data (a pair of data
being, for example, the number of TCR MCL violations in state databases and the
corresponding number in SDWIS/FED). Each pair of data (the data point in the state
database and the corresponding value in SDWIS/FED) was sorted by the maximum and
minimum value. The totals were put into the following equation:
       Minimum discrepancy rate =     sum of maximum # violations - sum of minimum # violations
                                                Sum of maximum # violations
EPA divided by the maximum number of violations to keep the discrepancy rates below
100%.

5.2.4  Main reasons cited by states for these discrepancies
Overall, the category of "data entry problems" was the most common reason given for
discrepancies. This includes incomplete PWS inventories; and data submission, transfer
file format, and coding problems. The category of "resource limitations" was the next
most common reason for discrepancies. This includes the inability of a state system to
upload data to SDWIS/FED, lack of staff and/or programmers, and no automated tracking
system for a particular rule.
The following number of states citing these reasons for discrepancies, by violation type:
                                      MCL           M/R            TT
                                  Under-    Over-    Under-    Over-   Under-    Over-
     Keason                       reporting  reporting reporting  reporting reporting  reporting
     Data Entry
     Resource Limitations
     Regulation Interpretation Issues
     ACR Guidance Interpretation Issues
     Late Reporting
     Automated System Generation
     Reason not provided
14      8      23      9       9
7              7              11
3              7
5265
1              412
               1
20      10      31      18      11
For under-reporting, M/R violations had the most discrepancies. The most frequently
cited reason is data entry. This was followed by TT violations, with the most frequently
                                        32

-------
cited reason being resource limitations, followed by data entry. MCL violations had the
fewest discrepancies. The most frequently cited reason is data entry.

Most frequently cited reasons for discrepancies, by rule:
                             Dataentrv   Resource     ACR      Regulation
                             "     y   limitations   guidance  implementation
                 Chems

                 TCR
                 SWTR
                 LCR
wa
M/R
MCL
M/R
TT
M/R
TT
M/R
#1
#1
#1
#1
#1
#1
#2
#2
#2
#2
#2 #2


#2
#1
#1
5.3   Error reports analysis—data transfer errors

5.3.1  Background
This analysis reviewed 2 quarters of error production reports (received during the period
August 1 through December 31, 1998), to look at the magnitude of, and reasons for, data
transfer errors between state databases and SDWIS/FED. Eight hundred forty one (841)
files were reviewed. Three hundred two (302) files were analyzed in detail to determine
the error rejection rate and the error correction rate.
At the state level, the information obtained will be used to provide recommendations for
corrective actions, training needs identification, and quality assurance procedures.
Because the method of update, error correction, and level of effort states expend on
correcting errors varies from quarter to quarter, extrapolating the errors analysis
information to determine a national level rejection rate for each type of submission and/or
data type was determined to be  inconclusive. Additional meta-data will need to be
collected in the future if more detail is desired on rejection rates.

5.3.2  Common types of errors
Of the over 800 possible error conditions which are programmed into SDWIS/FED edit
criteria, only 230 occurred in the 841 files analyzed. The most common reasons are listed
below:
        27%
        14%

         8%
         8%
         8%

         7%
         7%
         6%
Invalid values: typos, non-permitted values, etc.
Cross Edits: data rejected because a comparison between two or more attributes yielded
incompatible values.
Non-Existent Data: attempts to modify or delete data or records which do not exist on the database.
Processing Rule: comparison between two or more attributes showed invalid combinations.
Missing Registration Requirements: attempts to post a new water system without all required
elements present.
Content: missing values and/or missing combinations of data.
SDWIS/FED bugs and software limitations.
Duplicate Data: data already exists in the input file or in the database.
Eighty-two percent (82%) of the error "types" relate to data entry errors (e.g., failure to
follow data entry instructions, keypunch, missing or incomplete data, or invalid values).
SDWIS/FED bugs and software limitations represent 7% or less of the errors. The
                                         33

-------
remaining 11 % included informational messages, old FRDS conversion errors, and errors
that could be either a SDWIS/FED bug or a state data entry error depending on the data
submitted.

5.3.3  Main reasons cited for non-reporting during the analysis period
State resource limitation was given as the primary reason for Lead & Copper sample data
not being reported during the analysis period. Three states were unable to submit action
files due to major system software reprogramming or data clean-up activities. Those
failing to submit any inventory data during 1998 cited major system software conversion
activities or state resource limitations as the reason.

5.3.4  Rejection rates of files
Rejection rates for inventory and actions data were calculated for files submitted using
the Traditional method. It was not possible to calculate comparable rejection rates using
the Total Replace method because SDWIS/FED cannot identify which data in the file are
being submitted for the first time. The following equation was used;
       The error rates of files using the Traditional method = # lines in error file
     # lines in input file
In Traditional updates, 20% of inventory data and 32% of violation and enforcement
actions data are being rejected.

5.3.5  States' success in submitting correction files
Most states attempted corrective actions. When data are rejected from SDWIS/FED,
states (or EPA regions acting on the state's behalf) are  sent error reports indicating what
data were rejected and the reason(s) for the rejection. It appears that error files having a
large number and/or variety of errors were not being corrected on the first attempt. Some
errors did not require correction, such as duplicate records being submitted, or intentional
manipulation by the state or EPA region in order to achieve a specific result. It was not
possible to accurately determine the volume of such errors. Only a quarter of all states
were completely successful in resubmitting rejected data on their first attempt. Three-
fourths of the states had at least a quarter of the second attempt reject. Reasons for errors
remaining uncorrected include: states did not understand how to correct the original error,
they chose to correct only some errors, or, as mentioned above, some errors do not
require correction.

5.4   State structures analysis
Analysis of the ASDWA Management and Data Flow of States survey failed to produce
any clear-cut reasons for particular state drinking water programs to have better data
quality, defined as consistency between state records and SDWIS/FED. To perform the
analysis, state ranking was determined by dividing the total number of discrepancies for
violations between State records and SDWIS/FED by the total number of violations, and
obtaining a percentage. Then staff looked to see if there was a correlation between the
way a state is organized and its data quality, as measured by its discrepancy rate.
                                        34

-------
Analysis showed that both the highest and lowest ranking states had similar responses to
the survey questions.
Because the analysis of the ASDWA data provided no clear-cut answers, EPA asked the
Cadmus Group which has conducted data verifications in the past to select several states
that it believes maintain model programs and summarize the organizational structure of
those states. These states were selected regardless of any violation or discrepancy
numbers present in DV reports. The number or levels of the organization do not appear to
have as great an impact on data quality as does the quality of communications. Another
key factor was an adequate number of trained, qualified personnel. The program and
management structural components which were believed to be critical to promoting high
data quality, are presented below:

•   Communication: Routine, meaningful and timely communication at all levels.

•   Annual PWS notification of monitoring schedules and requirements

•   Automated compliance determination for monitoring requirements

•   Violation notification with required corrective action instructions

•   Standard operating procedures and related periodic training including: data entry,
    forms completion, conducting sanitary surveys, and compliance determination
•   Efficient and timely method of access to water system data for all staff

•   Electronic access to laboratory  sample data

•   Existence and use of a quality assurance program which resolves and prevents errors

•   Standardized data submission format (electronic or forms) from PWS and labs

•   Streamlined handling process of document and analytical result handling through
    compliance determination and recording violations and follow-up actions

5.5  State summaries of SDWIS/FED data quality, and recommended
     improvements
The last component of this project is the EPA analysis of SDWIS/FED data quality on a
state-by-state basis.  The resulting state summary reports will provide specific prioritized
recommendations  to help states improve their data quality. The  individual summaries will
be provided to  states separately during the spring of 2000. The summaries will address
the following state-specific findings.

•   ACR vs. SDWIS/FED analysis findings which highlight areas of zero and non-
    reporting, and violation type discrepancies which are greater than 10%.

•   Numeric and non-numeric findings from data verifications conducted between 1996
    and 1998. Data verifications conducted during 1999 are used to clarify or support
    findings from other analysis areas. Strengths  and areas of weakness, which impact
    data quality, are highlighted.
                                       35

-------
•  The number of violations reported by each state in each fiscal year between 1993 and
   1998, from the frozen database analysis. Violations are categorized by
   contaminant/rule and by water system type.
•  A discussion of state management structures with recommendations for improvement
   including a listing of key components that promote good SDWIS/FED data quality.
•  Significant findings from EPA Mid-Year and/or End-of-year Program Reviews
   relating to data management and SDWIS/FED data quality.
•  An analysis of SDWIS/FED error reports from data submitted during August 1,  1998
   through December 31, 1998.
                                      36

-------
       Appendix A—Stakeholders Working Group recommendations
Recommendations were identified and evaluated during the three major phases of the
Data Reliability Action Plan. The first phase involved the 3 public stakeholder meetings.
The second phase involved the individual analyses that were conducted, the results of
which are included in this report. The third phase was the Stakeholder Work Group
review of the preliminary findings of the data verifications, error report, ACR and
timeliness analyses at the September 1999 meeting and additional recommendations were
suggested. All recommendations were discusses and voted on at the meeting. The
following table presents the results of that vote.
# votes  Recommendation
19
17
15
15
11
10
7
7
6
6
6
5
Increase training
• Provide on-site assistance to resolve state-specific data entry problems.
• Provide additional compliance determination training, and data entry
training for new and existing rules
• Establish a multi-regional cadre of trainers (funded through either a central
contract and/or with the states paying for travel).
Improve the data verifications audits
• Include specific, prioritized, implementable recommendations.
• Include the # of systems with discrepancies.
• Conduct DVs for each state every 2-3 years, which will help promote and
track follow-up to previous DV recommendations.
• Issue DV procedures so states can perform self-audits
• Review data at the water system level to correlate data in state files
• Add a timeliness review
• Make follow-up of DVs part of regional quarterly/annual reviews
• Tighten follow-up procedures — have the EPA regional office check back
with states within 6 months
Streamline reporting and rule complexity
Make error reports more user-friendly. It is currently very difficult for
managers to use them to identify specific problems
Encourage states to notify utilities annually of compliance monitoring
schedules
EPA should focus follow-up on poorer state/regional performers
• Focus on states not reporting specific rules — should trigger a focused DV
audit
Require electronic reporting of monitoring regulations in the future
Require states to issue notices to utilities for each violation
Require labs to report sample results directly to states electronically
Improve front end retrieval of SDWIS/FED data
EPA HQ should provide contract funds for data management technical
assistance
Provide new resources for data management
                                      37

-------
5
5
4
3
3
3
2
1
1
Enable utilities to review their data before it is sent to SDWIS/FED
• Encourage state web access
• Ask trade association to communicate need for states to have additional
resources to enable web access
Establish a multi-state cadre of state peer reviewers.
• States provide travel funds
• Voluntary basis
Focus national program guidance on M/R discrepancies
• Help mitigate funds drawn to other media
Develop automated compliance determination mechanisms in SDWIS/STATE
Centralize Oracle DBA support (this recommendation applies to all states, not
only those using SDWIS/STATE)
Establish contract funds to help states enter data on an as-needed basis
Provide better guidance, including data flow diagrams, when new rules are
issued
Have EPA over-file for states which choose not to report
Complete the edit summary report to identify generic errors
0 (Standardize data transfer mechanisms
38

-------