EPA 2020 COVID-19 Water Sector Survey Report, Appendix A: Survey Methodology


EPA 2020 COVID-19
WATER SECTOR SURVEY
Appendix A: Survey Methodology
K

-------
A.I.! a ry
In 2020, the U.S. Environmental Protection Agency (EPA) surveyed the water sector to learn about the
effects of the COVID-19 pandemic on supply chain, workforce, finance, sample collection and analysis,
and cybersecurity issues. It would have been prohibitively expensive to collect this information from
every water and wastewater utility; therefore, the Agency selected a random sample to participate in
the survey. Planning and design of the survey began in June of 2020. The sample and questionnaire were
developed from July through September of 2020. Data collection took place from October through
December 2020. Data processing and analysis continued through February 2021.
The information was collected through a self-administered electronic questionnaire. The responses were
compiled in an electronic database, and weighted estimates were developed to estimate the effect of
COVID on the sector as a whole and on subpopulations of utilities. The estimates include 95 percent
confidence intervals to characterize the uncertainty introduced by sampling. The estimates and
confidence intervals take into account the sampling design. This appendix describes in detail the
approach used to conduct the survey and to analyze the results. The key features of the approach are:
• The estimates are based on data provided by a stratified random sample of water utilities. The
approximately 61,000 utilities in the nation consist of approximately 48,000 community water
systems (CWSs) providing drinking water, 12,000 wastewater treatment facilities (WWTFs), 300
American Indian (Al) utilities, and 180 Alaska Native Village (ANV) utilities. Al and ANV utilities
may provide drinking water services only, wastewater services only, or both drinking water and
wastewater services. The approximately 95,000 non-community water systems were not
included in the survey. For the purposes of this report, all four groups (CWSs, WWTFs, Al
utilities, and ANV utilities) are collectively referred to as "utilities." Utilities were further divided
into size categories based on the number of people they serve, for a total of 17 mutually
exclusive and exhaustive strata. The number of utilities in the inventory, the size of the sample,
the number of respondents, and the response rate are shown by stratum in table A.l.
• The survey was designed to estimate a proportion of utilities that face issues related to the
CQWil t < ft ,ndemic in each stratum wiili percent confidence interval of plus or minus 5
percentage points. EPA set the targets by stratum for CWSs and WWTFs. For Al and ANV
utilities, EPA set the precision target for each group of utilities as a whole, not by population size
category. EPA oversampled to account for expected non-response.
• An EPA workgroup of staff from the Office of Ground Water and Drinking Water and the Office
of Wastewater Management developed the questionnaire. The workgroup developed and
refined a list of questions for each section, with attention to formatting and phrasing the
questions to ensure they were clear and would collect information that will be most useful.
• Utilities were sent an email with a link to the electronic questionnaire, which they filled out
and submitted. The questionnaire was coded and distributed to the sampled utilities using
Qualtrics software, which also was used to track their responses and progress. Approximately 30
percent of the sample responded to the survey.
• The sample was used to characterize the effect of COV ri the water sector. EPA
developed sample weights that reflect the selection probabilities of each utility and the
response rates by stratum. EPA also evaluated the potential error introduced by survey non-
response by calling a subset of the utilities that did not respond to the survey and determined
that the potential bias is small.
>EPA

-------
• EPA implemented a thorough quality assurance process for the survey. The process included
automated validation checks, strict version control of the questionnaires, databases, programs,
and reports, and procedures to back-up and secure the data.
Table A.l. Inventory and Sample of Water and Wastewater Utilities for the 2020
Covid-19 Water Sector Survey
Sampling Stratum (Type of
Inventory of



Service and Size of the
Utilities
Sample
Survey
Response
Population Served)
(Sampling Frame)
Selected
Respondents
Rate
CWS Less than 501
25,902
948
206
21.7%
CWS 501 -3,300
12,623
933
249
26.7%
CWS 3,301 -10,000
8,568
920
304
33.0%
CWS 10,001 -100,000
670
567
199
35.1%
CWS Greater than 100,000
707
579
211
36.4%
Subtotal, CWS
48,470
3,947
1,169
29.6%
WWTF Less than 10,000
9,684
925
258
27.7%
WWTF 10,000 - 99,999
2,015
808
297
36.8%
WWTF 100,000 or more
354
319
114
35.7%
WWTF Size Unknown
22
22
4
18.2%
Subtotal, WWTF
12,075
2,074
673
32.4%
Al Less than 101
100
82
17
20.7%
Al 101-500
87
75
16
21.3%
Al 501 -3,300
98
78
19
24.4%
Al 3,301 - 10,000
36
36
16
44.4%
Al Greater than 10,000
10
10
3
30.0%
Subtotal, Al Utilities
331
281
71
25.3%
ANV Less than 101
61
61
13
21.3%
ANV 101-500
90
90
22
24.4%
ANV 501 -3,300
28
28
8
28.6%
Subtotal, ANV Utilities
179
179
43
24.0%
Total
61,055
6,481
1,956
30.2%
Additional detail on the methodology for the survey is provided in this appendix. The approach is
described in the following sections:
A.2. Study Background and Timeline: provides an overview and background of the survey
approach
A.3 Sampling Design and Weighting: explains how the sample was selected and weights were
developed
A.4. Survey Design and Response: describes how the questionnaire was developed, how the
survey was administered, and the survey response
A.5. Quality Assurance: describes the survey's quality assurance procedures
2
nvEPA

-------
A.2.Study Background and Timeline
EPA developed the survey to collect information about COVID-19-related needs in the water sector. The
survey was designed to collect information about potential supply chain, workforce, financial, analytical
support, and cybersecurity challenges from CWSs, WWTFs, Al utilities, and ANV utilities. In this report,
those four groups are referred to collectively as "utilities." The data were collected through an on-line
survey form using Qualtrics software.
Planning and design of the survey began in June of 2020. An Emergency Information Collection Request
(Emergency ICR) was approved by the Office of Management and Budget on June 6. The questionnaire
was developed by an EPA workgroup with representatives from the Water Security Division and Drinking
Water Protection Division of the Office of Ground Water and Drinking Water and representatives from
the Office of Wastewater Management. The draft questionnaire was reviewed by contractor water
sector experts and representatives of water sector associations and revised based on their input. In
addition, a pre-test of the questionnaire was conducted in September of 2020, and a pilot test of the
questionnaire and data collection approach was conducted in October of 2020. Full scale data collection
from a stratified random sample of utilities took place from October through December 2020. Data
processing and analysis continued through February 2021.
A.3.Sampling Design and Weighting
A,3,1, Sample Design
The survey relied on a probability sample of utilities. The survey used a stratified random sampling
design to ensure the sample was representative of the range of utilities in the country (including all 50
states, the District of Columbia, Puerto Rico, and the territories). The first step in the process was to
create an inventory of utilities from which to draw the sample, also known as the sampling frame. The
utilities in the frame were divided into seventeen mutually exclusive and exhaustive strata based on the
utility type (CWS, WWTF, Al utility, or ANV utility) and the size of the population served by the utility. A
sampling plan was developed, which determined the sample size needed in each stratum. The sample
was then selected and the sample weights were developed.
Sampling Frame
Several sources were used to develop the list of CWSs, WWTFs, Al utilities, and ANV utilities.
The list of CWSs is from the sampling frame developed for the seventh Drinking Water Infrastructure
Needs Survey and Assessment (DWINSA). This frame is based on the list of active CWSs in the federal
Safe Drinking Water Information System (SDWIS/fed), which was checked and updated by the state
drinking water programs. For the COVID survey, EPA divided the CWSs into five size categories based on
the number of people served by each system:
• 500 or fewer people served
• 501 to 3,300 people served
• 3,301 to 10,000 people served
• 10,001 to 100,000 people served
• More than 100,000 people served
The inventory of WWTFs is from the 2012 Clean Water Infrastructure Needs Survey (CWNS) sampling
frame. For the COVID survey, EPA added facilities from South Carolina, which were not included in the
CWNS frame. The list of South Carolina plants was extracted from EPA's Enforcement and Compliance
History Online (ECHO) system. EPA removed Al and ANV facilities from the WWTF sampling frame as
>EPA

-------
they were to be sampled separately. EPA divided the WWTFs into three categories based on the number
of people served by each facility:
• 9,999 or fewer people served
• 10,000 to 99,999 people served
• 100,000 or more people served
In contrast to CWSs and WWTFs, the survey sampled American Indian and Alaska Native Village water
and wastewater service providers by utility. Utilities may include more than one system or facility and
may provide both drinking water and wastewater services. EPA used the list of utilities that was
developed for a study of Al and ANV utility operations and maintenance expenses conducted by the
Indian Health Service (IHS). Al and ANV utilities were divided among five size categories based on the
number of people served:
• 100 or fewer people served
• 101 to 500 people served
• 501 to 3,300 people served
• 3,301 to 10,000 people served
• More than 10,000 people served (no ANV utilities serve more than 10,000 people)
For all four groups, EPA started with system or utility names and contact information (email addresses
and telephone numbers) as available through publicly accessible databases from the original frames.
After the sample was drawn, EPA reached out to state drinking water programs, EPA Regions, and EPA
partners in Alaska to review the contact information of utilities in the sample and provide updated
information when possible. EPA also called utilities and conducted Web searches to obtain missing email
addresses.
Sample Design and Selection
For CWSs and WWTFs, the survey was designed to estimate a proportion with a 95 percent confidence
interval of plus or minus 5 percentage points within each stratum. The width of a confidence interval
depends on the variability of the data and the size of the sample. EPA assumed the variance of
proportions estimated using the sample would be 0.25. (This is conservative because it is the largest
possible variance for a proportion.) Given this assumption about the variance, EPA then determined the
sample size needed to meet the precision target. The sample was distributed among the states within
each stratum to ensure each state was included in the sample.
The survey was designed to estimate a proportion with a 95 percent confidence interval of plus or minus
5 percentage points for Al utilities in the aggregate and for ANV utilities in the aggregate, not by
population size category. The precision targets were set in the aggregate because the small number of
Al and ANV utilities would have required a relatively large sample to meet the more stringent precision
targets. The estimates by stratum for Al and ANV utilities therefore may be less precise than the
estimates by stratum for CWSs and WWTFs. The sample was allocated among the size categories to
ensure utilities from each were included. Al utilities were distributed among the 10 EPA regions and the
Navajo Nation.
A 95 percent confidence interval is an estimated range of values such that there is a 95% chance that the
interval includes the true value. So a 95 percent confidence interval of 45%-55% (which could be
expressed as 50% +/- 5%), calculated based on the data from the sample, means that there is a 95
percent chance that the range of 45%-55% contains the true percentage value for the population at
large.
>EPA

-------
The sample was selected using Stata's sample function. EPA assumed the response rate would be 60
percent when it selected the sample and oversampled to account for non-response. The initial response
rate was below 60 percent during the first several weeks of the data collection period; therefore,
additional utilities were selected to increase the sample size. Table A.l, above, shows the size of the
final inventory of utilities by stratum, the size of the sample selected, and the number of final responses,
and the response rate by stratum.
Strata Migration
Errors in the frame classification of the utilities by population served potentially introduces inefficiency
in the sample design through a loss of sample size and/or by introducing unequal sampling rates. Among
the respondents, 94 percent reported the same population served category as indicated by the frame.
Table A.2 compares the classification of CWSs by their population served using the population data from
the frame and from the systems' responses to the survey.
Table A.2. CWS Respondents by the Frame-Based and Sample-Based Size
Categories
Sample-Based Population-Served Categories
Frame-Based

Greater

Population Served
500 or
501-
3,301 -
10,001 -
than

Category
Less
3,300
10,000
100,000
100,000
Total
Less Than 501
942
3
0
0
3
948
501 -3,300
41
874
12
6
0
933
3,301 -10,000
1
28
772
118
1
920
10,001 -100,000
1
2
2
558
4
567
Greater than 100,000
0
1
1
61
516
579
Total
985
908
787
743
524
3,947
Table A.3 compares the classification of WWTFs by their population served using the population data
from the frame and from the facilities' responses to the survey.
Table A.3. WWTF Respondents by the Frame-Based and Sample-Based Size
Categories
Frame-Based Sample-Based Population-Served Categories
Population Served
Less than
10,000 -
100,000

Category
10,000
99,999
or more
Total
Less than 10,000
942
4
1
947
10,000 - 99,999
41
754
13
808
100,000 or more
4
14
301
319
Total
987
772
315
2,074
Table A.4 compares the classification of Al and ANV utilities by their population served using the
population data from the frame and from the utilities' responses to the survey.
5
nvEPA

-------
Table A.4. WWTF Respondents by the Frame-Based and Sample-Based Size
Categories

Sample-Based Population-Served Categories

Frame-Based
Less

Greater

Population Served
than
101-
501-
3,301 -
than

Category
101
500
3,300
10,000
10,000
Total
American Indian Utilities
Less than 101
71
2
7
0
2
82
101-500
2
65
6
1
1
75
501 -3,300
1
0
73
2
2
78
3,301 -10,000
0
1
7
25
3
36
Greater than 10,000
0
0
1
1
8
10
Total
74
68
94
29
16
281
Alaska Native Village Utilities
Less than 101
57
3
1
0
0
61
101-500
3
77
9
1
0
90
501 -3,300
0
2
25
1
0
28
3,301 -10,000
0
0
0
0
0
0
Greater than 10,000
0
0
0
0
0
0
Total 60 82 35 2 179 179
EPA analyzed the results of the survey using the population totals reported in the survey, not the
original population totals from the sampling frame. EPA did not trim or otherwise adjust the weights
because of the strata migration. (See section 2.2 for a description of the derivation of the sampling
weights.)
A.3.2. Weighting and Estimation
A sample weight is attached to each responding utility record to (1) account for differential selection
probabilities by stratum, and (2) reduce the potential bias resulting from non-response. The sampling
weights are necessary for estimation of the population characteristics of interest. The sample variance is
then used to calculate 95 percent confidence intervals for the estimates.
Derivation of Base Weight and Non-response Adjustment
The sample consists of a stratified element sample of utilities. As described above, the utilities were
stratified by the type of service provided and the size of the population served. Seventeen sampling
strata were defined; all weights are calculated by stratum. The sampling weights are calculated in
several steps.
Base weights
The first step was the calculation of the base sampling weight for each sampled utility. A simple random
sample is selected within each stratum. Therefore, for all utilities the base sampling weight equals the
number of utilities in the stratum divided by the number sampled from that stratum. In other words, the
base weight for the hth stratum, Bh, is:
6
nvEPA

-------
where Nh represents the number of utilities in the stratum in the sampling frame, and n0h represents the
number of utilities sampled from the stratum.
Non-response adjustrnent
The second step in the weights calculation was to make a unit non-response adjustment to the base
sampling weights. For each stratum, the non-response adjustment factor is equal to the ratio of the
number of utilities that completed the survey plus the number of non-respondents to the number of
systems that completed the survey (i.e., the reciprocal of the stratum response rate). Ineligible
systems—for example, utilities that were determined to be inactive—are not incorporated into the unit
non-response adjustment. The adjustment factor for the hth stratum is given by 6h:
c nh+rh
(2) oh = ——
nh
where nh is the number of utilities that completed the survey and rh is the number of refusals and other
non-respondents in the hth stratum.
Final weights
The non-response adjustment factor 6h was multiplied by the base sampling weight, Bh, to obtain the
non-response-adjusted base sampling weight. The non-response adjusted weights can be written as:
(3) Wh = BhSh
item non-response adjustment
Only approximately one-third of the respondents answered all of the questions about actual and
budgeted income and expenditures. The weights were further adjusted to account for this non-
response. The adjustment is similar to equation 2, but the non-respondents include utilities that
responded to the survey but skipped the specific question.
Variance Estimation
The sampling design affects the standard error of the estimates. The stratification of the utilities by
service provided and population served will tend to reduce the overall sample variance, as utilities
within each stratum tend to be similar to each other and different from utilities in other strata.
The variance is estimated using a first-order Taylor expansion. The variance estimator, which was
calculated in Stata for this survey, is given by:
(4) V(R) = ±{V(?) - 2RCdv(Y,X) + R2?(X)}
where R = the ratio of estimates of two population totals. Y is equal to Ylh=l T?i=\ Eyf" whijyhij>
and X is equal to Ylh=i \ Eyf" whijxhij¦ L is the number of strata, rrih is the number of primary
sampling units in strata h, and nw is the number of elements in the ith primary sampling unit in the hth
strata.
Most of the estimates produced are either proportion or means . (The sample was designed to estimate
proportions with a 95 percent confidence interval of plus or minus 5 percentage points. It was not
designed to estimate means with a given level of precision. But the sample is used to estimate means

-------
and totals in some cases.) A mean is simply a ratio in which Xhij is equal to 1. A proportion is simply a
mean in which yhij is equal to a 0/1 variable.1
A finite population correction factor was derived because the sample was relatively large and was
selected without replacement. The factor is the ratio of systems in the sample to the number of systems
in each stratum, nh/Nh.
To estimate the variance, we first define the following ratio residual:
(5)	dfiij = jiVhij ~ Rxhij)
We then define the weighted total of the ratio residual as
(6)	zdhi = Y^=(whijdhij
and the weighted average of the residual as:
(7)	zdh =-^-l?=1zdhi
fnh
We can then define the variance estimate as:
(8)	V{R) = ELi(l - fh)~^X=i(.zdhiZdh)2
IlLfi 1
where//, is the finite population correction.
We use a logit transform of the proportion to calculate the confidence limits of each proportion. If p is
the estimated proportion, the logit transform of the proportion is
(9)	/(p) = In )
If s is the estimated standard error, derived from (8), the standard error of /(p) is given by:
(10)	sEinp)]=^-f)
The 95 percent confidence interval is thus:
(11)	ln(-^) + tl:°02XS
v '	\l-pj — p(l-p)
Where ti-0.025,v is the 0.025th quantile of the Student's t distribution with v degrees of freedom. The
endpoints are then transformed back to the proportion metric by using the inverse of the logit
transform.
A.4.Survey Design and Response
The survey was administered through an on-line self-administered questionnaire. This section describes
the survey instrument, the processes for distributing the questionnaires, and the process to assure
sufficient response rates and the handling of returned questionnaires.
^ee W.G. Cochran's 1977 Sampling Techniques (New York: John Wiley & Sons) for more
information about variance estimates.
>EPA

-------
A,4,1, Questionnaire Design
Following the approval of the Emergency ICR by the Office of Management and Budget on June 6, 2020,
an EPA workgroup composed of staff from the Water Security Division and Drinking Water Protection
Division of the Office of Ground Water and Drinking Water and from the Office of Wastewater
Management developed the questionnaire with the assistance of the Cadmus Group, LLC. The
workgroup developed and refined a list of questions for each section, with attention to formatting and
phrasing the questions to ensure they were clear and would collect information that will be most useful.
Cadmus programed the questionnaire in Qualtrics. The topics covered were:
A. Demographics (basic information about the respondent)
B. Supply chain issues
C. Workforce issues
D. Financial issues
E. Analytical support issues
F. Cybersecurity issues
G. General information and concerns about the future
Sections B-G of the survey included free-response questions, allowing respondents to explain how they
developed their answers or provide any additional information they considered relevant.
Senior Cadmus water sector experts reviewed the questionnaire for terminology that could potentially
confuse utility staff and to estimate the length of time that would be required to complete it.
Representatives of several water sector associations reviewed the questionnaire as well. EPA conducted
a pre-test of the survey with four utility volunteers in August 2020 to test the emailing of survey links
and the online data collection process and to get additional feedback on the content and language of
the survey. All four of the utility volunteers provided written input, and two participated in a debriefing
conference call as well. EPA made minor changes to the data collection approach and questionnaire as a
result of the internal and external review and pre-test.
Further minor changes were made to the questionnaire following review by Agency senior management
in September 2020.
A,4,2, Survey Administration
EPA and water sector partners conducted outreach in advance of the survey to inform water and
wastewater utilities of the data collection effort and explain the purpose of the survey.
The survey began with a pilot-test "soft launch" on October 1, 2020 to an initial batch of 100 utilities
from the sample, to test the email and data collection systems and the technical support system. The
full launch followed on October 6, 2020. A second round of emails was sent on October 29 to increase
the total number of respondents. Most responses were received in October and November; some
trailing responses were received in December. Data collection ceased on December 31, 2020.
Each survey participant received a unique link, along with instructions for responding, helpdesk
information, an "opt out" link, and an invitation to contact EPA staff to verify the legitimacy of the
survey. Each survey link was unique to enable survey tracking. EPA established a process to provide
technical assistance and guidance to the survey respondents via a contractor-staffed helpdesk and a
dedicated EPA email address. Follow-up email and phone contact was made with survey recipients to
encourage them to respond. In some cases, state staff and other stakeholders also helped encourage
survey participants to complete the survey. Helpdesk staff followed up periodically with each utility in
the sample until the utility either responded to the survey or declined to participate. The helpdesk staff
>EPA

-------
provided technical assistance as necessary, and in some cases helped those having technical trouble to
fill out the questionnaire.
During review of data logged in the Qualtrics system, if a problem or question arose, Cadmus staff
contacted the utility itself to resolve it.
A,4,3, .Survey Response and Evaluation of Possible Non-Response Bias
The data collection effort was closed out December 31, 2020. Of the 6,481 utilities sampled, 1,956
responded to the survey. The overall response rate was 30.2 percent. Non-response may introduce a
source of bias into estimates based on the survey if non-respondents differ from respondents in some
relevant way.
The potential effect of non-response bias on estimates of the issues facing utilities can be positive or
negative. On the one hand, systems that are struggling due to the pandemic may be unable to respond
to the survey. On the other hand, systems not reporting significant problems may not see the need to
participate. Therefore, both the sign (whether the bias is positive or negative) and the magnitude of
non-response bias is uncertain.
To evaluate the potential effect of the non-response bias, in mid-December of 2020 EPA contacted 37
utilities that were sampled but chose to not participate in the survey. EPA asked the utilities why they
did not participate and whether they expected the pandemic to affect their operations in terms of
supply chain, workforce, finances, analytic sampling, and cybersecurity. Of the 37 utilities successfully
contacted, 10 reported that the COVID-19 pandemic was not an issue or concern. The remaining utilities
indicated that they had declined to participate in the survey due to lack of time or resources. Thirty of
the 37 utilities provided information about the potential effects of the pandemic. Their responses are
shown in Table A.5.

-------
¦¦¦ \ e • ¦ "	COVID-19 on Utilities that I " I Respond to the
Survey

Number (and



Percentage) of Utilities
Number (and


Reporting that COVID-
Percentage) of Utilities


19 Does Not Raise this
Reporting that COVID-

Issue or Concern
Concern
19 Raises this Concern
Total
Supply Chain
22
(73%)
8
(27%)
30
(100%)
Workforce
19
11
30
(63%)
(37%)
(100%)
Finance
21
9
30
(70%)
(30%)
(100%)
Analytic Support
29
(97%)
1
(3%)
30
(100%)
Cybersecurity
28
(93%)
2
(27%)
30
(100%)
Other
25
5
30
(83%)
(17%)
(100%)
Any Issue
15
(50%)
15
(50%)
30
(100%)
In general, the percentages of non-respondents reporting concerns are comparable to the survey
findings. Based on the responses to the survey, roughly one-third of utilities had a supply chain concern
during 2020, compared to 27 percent among non-respondents. Approximately one-quarter of utilities
faced workforce shortages, and 37 percent of non-respondents reported workforce issues.
Approximately one-third of utilities had lower net revenue than they budgeted, which is consistent with
the 30 percent of non-respondents that reported financial concerns. Approximately one-tenth of utilities
had conditions that interfered with their ability to complete sampling and one-tenth had issues with
laboratory analysis, compared to the 3% of non-respondents that had concerns in these areas. The
contacted non-respondents were more likely to report cybersecurity concerns (27 percent) than the 1
percent of utilities that experienced cybersecurity issues in 2020. On the other hand, the survey findings
show that approximately one-quarter of the utilities have concerns about cybersecurity in the future.
A.5. Quality Assurance
The survey was conducted in accordance with an approved Programmatic Quality Assurance Project
Plan. Because of the short timeline of the emergency information collection, a Supplemental Quality
Assurance Project Plan (SQAPP) was not developed specifically for this survey. EPA followed the
guidelines and procedures established in the DWINSA QAPP. EPA enacted specific measures to check
and ensure the validity of the survey data from data collection through data processing and analysis, as
well as measures to assure the quality of other survey components.
A,5,1, Draft Questionnaire Pre-Test and Survey Pilot Test
A significant component of the survey quality assurance plan was to thoroughly test the questionnaire
design, the survey design, and the data collection procedures prior to implementing the full study.
Efforts to confirm the validity and effectiveness of these designs and revise them when the tests reveal
problems, errors, or difficulties led to design and process improvements in such areas as data reliability,
data completeness, accuracy of the sample frame, and response rates.

-------
Pre-test
When EPA had identified the initial data collection objectives and completed a working draft of the
questionnaire , EPA shared the draft with two groups of water sector experts: three contractor staff
members and representatives of several water association partners/stakeholders. Improvements were
made to the survey instrument based on the feedback received, and Cadmus staff provided an estimate
of the time it would take utilities to complete the survey. Then EPA conducted a formal pre-test of the
draft survey with four utility volunteers. The main objectives of the pre-test were to evaluate the clarity
of the questions and to test some of the Qualtrics system's functionality for emailing links.
The four utility volunteers reviewed the questionnaire in August 2020. Each provided written feedback,
responding to a set of questions. The volunteers explored questions regarding comprehensibility, use of
clear and appropriate terminology, provision of suitable response categories, and questionnaire layout.
The reviewers also evaluated the ease or difficulty of providing answers, their immediate knowledge of
or access to information requested by the questionnaire, and their overall reaction to the survey. Two of
the volunteers not only provided written feedback but also participated in a debrief via conference call.
Overall, the volunteers thought the questionnaire was clear and relatively easy to follow. As a result of
the pre-test, some questions were re-worded. Otherwise, the pre-test found no systematic problems in
respondents' ability to provide answers to the questions.
Pilot Test
The survey began with a "soft launch," which served as a full-scale pilot test of the survey instrument
and data collection procedures. The test was conducted in October 2020. The soft launch provided an
opportunity to test survey distribution, survey support, data systems, and logging of survey results on a
limited number of participants before the full launch.
One hundred utilities from the survey sample (40 CWSs, 40 WWTFs, and 20 Al utilities) were selected
from the full sample for use in the pilot. Based on the experience of the soft launch, modest changes
were made to the emails and the instructions for utilities for the full launch.
A,5,2, Sampling Quality Assurance
Quality assurance of the sampling process for the CWS Survey involved three principal areas:
1. Development of the sample frame
2. Sampling specifications
3. Use of software designed to draw samples
Development of the Sample Frame
EPA conducted an extensive review of the data used for the sample frame. For CWSs, the survey was
able to take advantage of the extensive data verification effort undertaken for the 7th DWINSA. The
DWINSA frame was developed with SDWIS data from the second quarter of 2019. State representatives
working on the DWINSA reviewed their respective lists of systems from the data freeze and made
changes to population and source categories as needed. The sample frame was then built using the data
from the states. The 2012 CWNS was the basis for the WWTF sampling frame. For the COVID survey, EPA
supplemented the CWNS data with information from ECHO to fill in gaps. The sampling frame for the Al
and ANV utilities used the inventory of utilities developed in 2016-2018 for IHS evaluations of the
operation and maintenance costs of tribally owned and operated American Indian and Alaska Native
Village drinking water and wastewater utilities. The IHS inventory was based on the list of systems and
wastewater facilities in SDWIS and in the IHS Operation and Maintenance Data System (OMDS). For the
IHS project, the IHS Al utility inventory was reviewed and verified by the federal EPA and IHS staff, and
>EPA

-------
the ANV utility inventory was reviewed and verified by EPA Region 10 and state of Alaska. The COVID
survey inventory of Al and ANV utilities was reviewed and verified by the federal and state staff.
Sampling Specifications
To carry out the sampling processes, the survey statisticians prepared written sampling plans that
served as directions for performing the sampling and as a permanent documentation of the process. The
specifications ensured the sample was drawn in conformity with the sample design and in a statistically
valid manner.
Sampling Software
The sample of utilities was drawn using a Stata program designed to draw stratified random samples.
EPA developed programs to draw the samples to ensure they could be replicated.
A,5,3, Data Collection Quality Assurance
Questionnaire Design
The various drafts of the questionnaires were the product of close review and comment by EPA, its
contractor, internal reviewers, and external stakeholder reviewers. Improvements also were made as a
result of the pre-test and pilot test.
Questionnaire version control was maintained through the various drafts by allowing for one master
copy and strictly enforcing version-control procedures. After changes were made to each version of the
questionnaire, a new electronic file with the date of the changes was created. Cadmus was responsible
for making all changes and could track changes over previous versions of the questionnaire.
The questionnaire form was iteratively reviewed and improved to make it clear and simple to use. The
on-line survey design incorporated skip patterns so respondents would answer only the appropriate
questions.
Email Data Collection
Analysts with extensive experience administering on-line surveys were responsible for the mailout. They
were provided with clear specifications by a senior staff member. The Qualtrics system permitted the
introductory email to be tailored to each utility, including providing each participant with a unique
survey link.
Each utility was assigned to one of 12 helpdesks, and each helpdesk was staffed by an analyst who
maintained contact with the helpdesk's utilities throughout the survey. The analysts provided reminder
calls and emails and technical support to the survey participants. The work of the analysts was overseen
by a back-up team and senior survey managers, and when necessary the team elevated questions all the
way to EPA.
When survey support staff learned that a survey had been received by someone other than the
intended utility recipient, they thanked the informant and instructed them not to complete the survey,
and they made phone calls or conducted Internet research to find better contact information. The
revised contact information was then entered into the project's data files. The affected surveys were
closed down, and fresh links were emailed to the newly acquired email addresses.
The tracking system ensured proper tracking and control of all questionnaires from the point of initial
emails of links until the data were formally submitted. Periodic status reports from the Qualtrics system
supported overall management of the project and also helped to identify response rate problem areas,
which enabled EPA to take appropriate follow-up measures.
>EPA

-------
A,5,4, Data Processing Quality Assurance
Qualities Data Extraction
The completed survey results were extracted from the Qualtrics database and reviewed. Entries that
needed to be removed (because they were from "test" surveys, or because there were duplicate entries
from a single utility, or because someone began filling it out a survey before realizing it was received in
error) were struck from the data set. The electronic data were then transformed into a hierarchical
database for detailed analysis. Procedures were in place to maintain the integrity and quality of the
data, as described below.
Automated Data Validation Checks
In preparing the final database, EPA and analyzed a series of computer validation checks. These checks
were run on the full survey database after the data were entered and passed the standard computer
edits for values and ranges on a variable-by-variable basis. The checks included the following:
1. Distribution frequencies for all categorical variables. The vast majority of the responses were
categorical—either yes/no responses or choices from a short list of possible responses.
2. Distribution frequencies for all the continuous numerical variables (budgeted and actual
revenue and expenses), formatted into four categories: non-zero responses, zero responses,
legitimately skipped, and missing.
3. Univariates for each continuous variable.
For the financial data reported in section D of the survey, EPA reviewed the distribution of responses to
identify potential outliers and survey response errors. For example, analysts were able to identify cases
where amounts were entered in millions of dollars rather than dollars by exploring the distribution of
responses by stratum. EPA made the changes to the version of the data used in the analysis; EPA did not
change the responses in the original dataset.
Database Quality Assurance
The final, clean survey database represented the product of the various review, editing, data entry, and
data validation steps described above. Once the database was prepared, there were a number of
subsequent data processing steps required to create a set of files to be used in the analyses and
tabulations for the report. The principal steps included:
1. Appending needed variables from external files, including sample and contact information from
the sampling frame.
2. Analyzing the questionnaires and the frequency distributions of continuous and categorical
variables to devise rules for handling missing data.
3. Zero-filling blank responses. A detailed series of rules was developed for assessing blank
responses and determining whether to regard these as zeros or missing values. A detailed set of
programming specifications was designed to implement these rules.
4. Creating new derived variables from the survey data to categorize systems into strata
comparable to the original sampling strata but based on the final survey responses about the
size of the population served, rather than the sampling frame's data.
5. Recoding "Other" responses on multiple choice questions in cases where the written-in
response was one of the multiple-choice options already available.

-------
6. Reviewing free-response answers to see if any information provided by respondents about how
they answered specific questions would require recoding.
7. Attaching the sample weights and finite population correction to the analytical file.
Each step was planned in advance. Early drafts of tables summarizing the survey responses were
provided to EPA for review. The results were independently reviewed and verified.
Version control was maintained for all custom computer code that EPA developed, and interim stages of
all data files were archived. This meant that when changes were made to a program or process, it was
clear which the current version was and the sequential changes that had been made from one version to
the next were apparent. It was always possible to restore any earlier version in full or to merge selected
data from the old version to the new version.
The combination of the processing specifications, version control, and data archiving ensured that no
process was irreversible, that it was always possible to recover from any deliberate or inadvertent
changes to the data, and that the characteristics of the survey data were fully known at each processing
stage.
Tabulation Quality Assurance
EPA produced the detailed summaries of each question by stratum (see Appendix C). The following steps
were taken to help ensure that each table accurately summarized and presented the data contained in
the final survey database.
1) Identify important, relevant, and useful information that could be developed from analyses of
the survey data.
2) Design each table to effectively present the results.
3) Clearly describe the contents of each table.
4) Define in detail the variables, values, formulas, and derivations that go into each calculation.
5) Prepare clear and detailed data processing specifications for carrying out the tabulations
according to the calculation definitions.
6) Develop computer programs to process the data pursuant to the tabulation specifications.
7) Review the initial output for:
a) Consistency with the design of the table
b) Conformity with the definitional and programming specifications
c) Reasonable agreement with expected values-based on external measures and expert
knowledge
8) Review definitions, specifications, programs, and underlying data for tabulations exhibiting data
anomalies or outliers.
9) Review definitions, specifications, or programs if the review process identifies errors or the need
for modifications to previous decisions.
10) Repeat previous tabulation quality assurance steps and re-run tabulations until no further
unacceptable data anomalies are found.
>EPA

-------
The tabulation process was fully automated, from the underlying source data through all processing
stages to the final formatted tables. There were no intermediate stages requiring manual transfer or
entry of data from one stage to the next. This eliminated human transcription error. Of equal
importance, it also expedited the process of successive iterations of the tabulations during the quality
review process, as each time a table was produced the output data automatically were transferred into
the same final table form as in the previous iteration. This ensured that any new anomalies identified in
later iterations did not result from transcription errors and allowed the review staff to focus their
investigations on the table data, specifications, and programs.
In preparing data tables for Appendix C, EPA redacted fields that contain information that could be used
to identify utility participants. These included fields with utility names and identification numbers, fields
with contact information, fields generated by Qualtrics with geolocation data, and all free-text response
fields. Fields with high-level geographical information like state or territory were retained. Before
redacting fields with free-response answers, EPA reviewed them for illustrative quotations to use in the
summary report.
A,5,5, Quality Assurance During Report Preparation
The findings presented in the report are based on the data tables presented in Appendix C and other
Stata outputs. The analyses were conducted in the statistical package Stata using a series of programs
(called "do files"). These programs were reviewed by at least two analysts and all changes were tracked
and documented. Decisions to exclude outliers or other data from analyses were documented.
Findings presented in the report, including quotations from free-response answers, were reviewed by
Cadmus and EPA staff and revisions to the report were tracked.

-------