EPA 2020 COVID-19 WATER SECTOR SURVEY Appendix A: Survey Methodology K ------- A.I.! a ry In 2020, the U.S. Environmental Protection Agency (EPA) surveyed the water sector to learn about the effects of the COVID-19 pandemic on supply chain, workforce, finance, sample collection and analysis, and cybersecurity issues. It would have been prohibitively expensive to collect this information from every water and wastewater utility; therefore, the Agency selected a random sample to participate in the survey. Planning and design of the survey began in June of 2020. The sample and questionnaire were developed from July through September of 2020. Data collection took place from October through December 2020. Data processing and analysis continued through February 2021. The information was collected through a self-administered electronic questionnaire. The responses were compiled in an electronic database, and weighted estimates were developed to estimate the effect of COVID on the sector as a whole and on subpopulations of utilities. The estimates include 95 percent confidence intervals to characterize the uncertainty introduced by sampling. The estimates and confidence intervals take into account the sampling design. This appendix describes in detail the approach used to conduct the survey and to analyze the results. The key features of the approach are: • The estimates are based on data provided by a stratified random sample of water utilities. The approximately 61,000 utilities in the nation consist of approximately 48,000 community water systems (CWSs) providing drinking water, 12,000 wastewater treatment facilities (WWTFs), 300 American Indian (Al) utilities, and 180 Alaska Native Village (ANV) utilities. Al and ANV utilities may provide drinking water services only, wastewater services only, or both drinking water and wastewater services. The approximately 95,000 non-community water systems were not included in the survey. For the purposes of this report, all four groups (CWSs, WWTFs, Al utilities, and ANV utilities) are collectively referred to as "utilities." Utilities were further divided into size categories based on the number of people they serve, for a total of 17 mutually exclusive and exhaustive strata. The number of utilities in the inventory, the size of the sample, the number of respondents, and the response rate are shown by stratum in table A.l. • The survey was designed to estimate a proportion of utilities that face issues related to the CQWil t < ft ,ndemic in each stratum wiili percent confidence interval of plus or minus 5 percentage points. EPA set the targets by stratum for CWSs and WWTFs. For Al and ANV utilities, EPA set the precision target for each group of utilities as a whole, not by population size category. EPA oversampled to account for expected non-response. • An EPA workgroup of staff from the Office of Ground Water and Drinking Water and the Office of Wastewater Management developed the questionnaire. The workgroup developed and refined a list of questions for each section, with attention to formatting and phrasing the questions to ensure they were clear and would collect information that will be most useful. • Utilities were sent an email with a link to the electronic questionnaire, which they filled out and submitted. The questionnaire was coded and distributed to the sampled utilities using Qualtrics software, which also was used to track their responses and progress. Approximately 30 percent of the sample responded to the survey. • The sample was used to characterize the effect of COV ri the water sector. EPA developed sample weights that reflect the selection probabilities of each utility and the response rates by stratum. EPA also evaluated the potential error introduced by survey non- response by calling a subset of the utilities that did not respond to the survey and determined that the potential bias is small. >EPA ------- • EPA implemented a thorough quality assurance process for the survey. The process included automated validation checks, strict version control of the questionnaires, databases, programs, and reports, and procedures to back-up and secure the data. Table A.l. Inventory and Sample of Water and Wastewater Utilities for the 2020 Covid-19 Water Sector Survey Sampling Stratum (Type of Inventory of Service and Size of the Utilities Sample Survey Response Population Served) (Sampling Frame) Selected Respondents Rate CWS Less than 501 25,902 948 206 21.7% CWS 501 -3,300 12,623 933 249 26.7% CWS 3,301 -10,000 8,568 920 304 33.0% CWS 10,001 -100,000 670 567 199 35.1% CWS Greater than 100,000 707 579 211 36.4% Subtotal, CWS 48,470 3,947 1,169 29.6% WWTF Less than 10,000 9,684 925 258 27.7% WWTF 10,000 - 99,999 2,015 808 297 36.8% WWTF 100,000 or more 354 319 114 35.7% WWTF Size Unknown 22 22 4 18.2% Subtotal, WWTF 12,075 2,074 673 32.4% Al Less than 101 100 82 17 20.7% Al 101-500 87 75 16 21.3% Al 501 -3,300 98 78 19 24.4% Al 3,301 - 10,000 36 36 16 44.4% Al Greater than 10,000 10 10 3 30.0% Subtotal, Al Utilities 331 281 71 25.3% ANV Less than 101 61 61 13 21.3% ANV 101-500 90 90 22 24.4% ANV 501 -3,300 28 28 8 28.6% Subtotal, ANV Utilities 179 179 43 24.0% Total 61,055 6,481 1,956 30.2% Additional detail on the methodology for the survey is provided in this appendix. The approach is described in the following sections: A.2. Study Background and Timeline: provides an overview and background of the survey approach A.3 Sampling Design and Weighting: explains how the sample was selected and weights were developed A.4. Survey Design and Response: describes how the questionnaire was developed, how the survey was administered, and the survey response A.5. Quality Assurance: describes the survey's quality assurance procedures 2 nvEPA ------- A.2.Study Background and Timeline EPA developed the survey to collect information about COVID-19-related needs in the water sector. The survey was designed to collect information about potential supply chain, workforce, financial, analytical support, and cybersecurity challenges from CWSs, WWTFs, Al utilities, and ANV utilities. In this report, those four groups are referred to collectively as "utilities." The data were collected through an on-line survey form using Qualtrics software. Planning and design of the survey began in June of 2020. An Emergency Information Collection Request (Emergency ICR) was approved by the Office of Management and Budget on June 6. The questionnaire was developed by an EPA workgroup with representatives from the Water Security Division and Drinking Water Protection Division of the Office of Ground Water and Drinking Water and representatives from the Office of Wastewater Management. The draft questionnaire was reviewed by contractor water sector experts and representatives of water sector associations and revised based on their input. In addition, a pre-test of the questionnaire was conducted in September of 2020, and a pilot test of the questionnaire and data collection approach was conducted in October of 2020. Full scale data collection from a stratified random sample of utilities took place from October through December 2020. Data processing and analysis continued through February 2021. A.3.Sampling Design and Weighting A,3,1, Sample Design The survey relied on a probability sample of utilities. The survey used a stratified random sampling design to ensure the sample was representative of the range of utilities in the country (including all 50 states, the District of Columbia, Puerto Rico, and the territories). The first step in the process was to create an inventory of utilities from which to draw the sample, also known as the sampling frame. The utilities in the frame were divided into seventeen mutually exclusive and exhaustive strata based on the utility type (CWS, WWTF, Al utility, or ANV utility) and the size of the population served by the utility. A sampling plan was developed, which determined the sample size needed in each stratum. The sample was then selected and the sample weights were developed. Sampling Frame Several sources were used to develop the list of CWSs, WWTFs, Al utilities, and ANV utilities. The list of CWSs is from the sampling frame developed for the seventh Drinking Water Infrastructure Needs Survey and Assessment (DWINSA). This frame is based on the list of active CWSs in the federal Safe Drinking Water Information System (SDWIS/fed), which was checked and updated by the state drinking water programs. For the COVID survey, EPA divided the CWSs into five size categories based on the number of people served by each system: • 500 or fewer people served • 501 to 3,300 people served • 3,301 to 10,000 people served • 10,001 to 100,000 people served • More than 100,000 people served The inventory of WWTFs is from the 2012 Clean Water Infrastructure Needs Survey (CWNS) sampling frame. For the COVID survey, EPA added facilities from South Carolina, which were not included in the CWNS frame. The list of South Carolina plants was extracted from EPA's Enforcement and Compliance History Online (ECHO) system. EPA removed Al and ANV facilities from the WWTF sampling frame as >EPA ------- they were to be sampled separately. EPA divided the WWTFs into three categories based on the number of people served by each facility: • 9,999 or fewer people served • 10,000 to 99,999 people served • 100,000 or more people served In contrast to CWSs and WWTFs, the survey sampled American Indian and Alaska Native Village water and wastewater service providers by utility. Utilities may include more than one system or facility and may provide both drinking water and wastewater services. EPA used the list of utilities that was developed for a study of Al and ANV utility operations and maintenance expenses conducted by the Indian Health Service (IHS). Al and ANV utilities were divided among five size categories based on the number of people served: • 100 or fewer people served • 101 to 500 people served • 501 to 3,300 people served • 3,301 to 10,000 people served • More than 10,000 people served (no ANV utilities serve more than 10,000 people) For all four groups, EPA started with system or utility names and contact information (email addresses and telephone numbers) as available through publicly accessible databases from the original frames. After the sample was drawn, EPA reached out to state drinking water programs, EPA Regions, and EPA partners in Alaska to review the contact information of utilities in the sample and provide updated information when possible. EPA also called utilities and conducted Web searches to obtain missing email addresses. Sample Design and Selection For CWSs and WWTFs, the survey was designed to estimate a proportion with a 95 percent confidence interval of plus or minus 5 percentage points within each stratum. The width of a confidence interval depends on the variability of the data and the size of the sample. EPA assumed the variance of proportions estimated using the sample would be 0.25. (This is conservative because it is the largest possible variance for a proportion.) Given this assumption about the variance, EPA then determined the sample size needed to meet the precision target. The sample was distributed among the states within each stratum to ensure each state was included in the sample. The survey was designed to estimate a proportion with a 95 percent confidence interval of plus or minus 5 percentage points for Al utilities in the aggregate and for ANV utilities in the aggregate, not by population size category. The precision targets were set in the aggregate because the small number of Al and ANV utilities would have required a relatively large sample to meet the more stringent precision targets. The estimates by stratum for Al and ANV utilities therefore may be less precise than the estimates by stratum for CWSs and WWTFs. The sample was allocated among the size categories to ensure utilities from each were included. Al utilities were distributed among the 10 EPA regions and the Navajo Nation. A 95 percent confidence interval is an estimated range of values such that there is a 95% chance that the interval includes the true value. So a 95 percent confidence interval of 45%-55% (which could be expressed as 50% +/- 5%), calculated based on the data from the sample, means that there is a 95 percent chance that the range of 45%-55% contains the true percentage value for the population at large. >EPA ------- The sample was selected using Stata's sample function. EPA assumed the response rate would be 60 percent when it selected the sample and oversampled to account for non-response. The initial response rate was below 60 percent during the first several weeks of the data collection period; therefore, additional utilities were selected to increase the sample size. Table A.l, above, shows the size of the final inventory of utilities by stratum, the size of the sample selected, and the number of final responses, and the response rate by stratum. Strata Migration Errors in the frame classification of the utilities by population served potentially introduces inefficiency in the sample design through a loss of sample size and/or by introducing unequal sampling rates. Among the respondents, 94 percent reported the same population served category as indicated by the frame. Table A.2 compares the classification of CWSs by their population served using the population data from the frame and from the systems' responses to the survey. Table A.2. CWS Respondents by the Frame-Based and Sample-Based Size Categories Sample-Based Population-Served Categories Frame-Based Greater Population Served 500 or 501- 3,301 - 10,001 - than Category Less 3,300 10,000 100,000 100,000 Total Less Than 501 942 3 0 0 3 948 501 -3,300 41 874 12 6 0 933 3,301 -10,000 1 28 772 118 1 920 10,001 -100,000 1 2 2 558 4 567 Greater than 100,000 0 1 1 61 516 579 Total 985 908 787 743 524 3,947 Table A.3 compares the classification of WWTFs by their population served using the population data from the frame and from the facilities' responses to the survey. Table A.3. WWTF Respondents by the Frame-Based and Sample-Based Size Categories Frame-Based Sample-Based Population-Served Categories Population Served Less than 10,000 - 100,000 Category 10,000 99,999 or more Total Less than 10,000 942 4 1 947 10,000 - 99,999 41 754 13 808 100,000 or more 4 14 301 319 Total 987 772 315 2,074 Table A.4 compares the classification of Al and ANV utilities by their population served using the population data from the frame and from the utilities' responses to the survey. 5 nvEPA ------- Table A.4. WWTF Respondents by the Frame-Based and Sample-Based Size Categories Sample-Based Population-Served Categories Frame-Based Less Greater Population Served than 101- 501- 3,301 - than Category 101 500 3,300 10,000 10,000 Total American Indian Utilities Less than 101 71 2 7 0 2 82 101-500 2 65 6 1 1 75 501 -3,300 1 0 73 2 2 78 3,301 -10,000 0 1 7 25 3 36 Greater than 10,000 0 0 1 1 8 10 Total 74 68 94 29 16 281 Alaska Native Village Utilities Less than 101 57 3 1 0 0 61 101-500 3 77 9 1 0 90 501 -3,300 0 2 25 1 0 28 3,301 -10,000 0 0 0 0 0 0 Greater than 10,000 0 0 0 0 0 0 Total 60 82 35 2 179 179 EPA analyzed the results of the survey using the population totals reported in the survey, not the original population totals from the sampling frame. EPA did not trim or otherwise adjust the weights because of the strata migration. (See section 2.2 for a description of the derivation of the sampling weights.) A.3.2. Weighting and Estimation A sample weight is attached to each responding utility record to (1) account for differential selection probabilities by stratum, and (2) reduce the potential bias resulting from non-response. The sampling weights are necessary for estimation of the population characteristics of interest. The sample variance is then used to calculate 95 percent confidence intervals for the estimates. Derivation of Base Weight and Non-response Adjustment The sample consists of a stratified element sample of utilities. As described above, the utilities were stratified by the type of service provided and the size of the population served. Seventeen sampling strata were defined; all weights are calculated by stratum. The sampling weights are calculated in several steps. Base weights The first step was the calculation of the base sampling weight for each sampled utility. A simple random sample is selected within each stratum. Therefore, for all utilities the base sampling weight equals the number of utilities in the stratum divided by the number sampled from that stratum. In other words, the base weight for the hth stratum, Bh, is: 6 nvEPA ------- where Nh represents the number of utilities in the stratum in the sampling frame, and n0h represents the number of utilities sampled from the stratum. Non-response adjustrnent The second step in the weights calculation was to make a unit non-response adjustment to the base sampling weights. For each stratum, the non-response adjustment factor is equal to the ratio of the number of utilities that completed the survey plus the number of non-respondents to the number of systems that completed the survey (i.e., the reciprocal of the stratum response rate). Ineligible systems—for example, utilities that were determined to be inactive—are not incorporated into the unit non-response adjustment. The adjustment factor for the hth stratum is given by 6h: c nh+rh (2) oh = —— nh where nh is the number of utilities that completed the survey and rh is the number of refusals and other non-respondents in the hth stratum. Final weights The non-response adjustment factor 6h was multiplied by the base sampling weight, Bh, to obtain the non-response-adjusted base sampling weight. The non-response adjusted weights can be written as: (3) Wh = BhSh item non-response adjustment Only approximately one-third of the respondents answered all of the questions about actual and budgeted income and expenditures. The weights were further adjusted to account for this non- response. The adjustment is similar to equation 2, but the non-respondents include utilities that responded to the survey but skipped the specific question. Variance Estimation The sampling design affects the standard error of the estimates. The stratification of the utilities by service provided and population served will tend to reduce the overall sample variance, as utilities within each stratum tend to be similar to each other and different from utilities in other strata. The variance is estimated using a first-order Taylor expansion. The variance estimator, which was calculated in Stata for this survey, is given by: (4) V(R) = ±{V(?) - 2RCdv(Y,X) + R2?(X)} where R = the ratio of estimates of two population totals. Y is equal to Ylh=l T?i=\ Eyf" whijyhij> and X is equal to Ylh=i \ Eyf" whijxhij¦ L is the number of strata, rrih is the number of primary sampling units in strata h, and nw is the number of elements in the ith primary sampling unit in the hth strata. Most of the estimates produced are either proportion or means . (The sample was designed to estimate proportions with a 95 percent confidence interval of plus or minus 5 percentage points. It was not designed to estimate means with a given level of precision. But the sample is used to estimate means ------- and totals in some cases.) A mean is simply a ratio in which Xhij is equal to 1. A proportion is simply a mean in which yhij is equal to a 0/1 variable.1 A finite population correction factor was derived because the sample was relatively large and was selected without replacement. The factor is the ratio of systems in the sample to the number of systems in each stratum, nh/Nh. To estimate the variance, we first define the following ratio residual: (5) dfiij = jiVhij ~ Rxhij) We then define the weighted total of the ratio residual as (6) zdhi = Y^=(whijdhij and the weighted average of the residual as: (7) zdh =-^-l?=1zdhi fnh We can then define the variance estimate as: (8) V{R) = ELi(l - fh)~^X=i(.zdhiZdh)2 IlLfi 1 where//, is the finite population correction. We use a logit transform of the proportion to calculate the confidence limits of each proportion. If p is the estimated proportion, the logit transform of the proportion is (9) /(p) = In ) If s is the estimated standard error, derived from (8), the standard error of /(p) is given by: (10) sEinp)]=^-f) The 95 percent confidence interval is thus: (11) ln(-^) + tl:°02XS v ' \l-pj — p(l-p) Where ti-0.025,v is the 0.025th quantile of the Student's t distribution with v degrees of freedom. The endpoints are then transformed back to the proportion metric by using the inverse of the logit transform. A.4.Survey Design and Response The survey was administered through an on-line self-administered questionnaire. This section describes the survey instrument, the processes for distributing the questionnaires, and the process to assure sufficient response rates and the handling of returned questionnaires. ^ee W.G. Cochran's 1977 Sampling Techniques (New York: John Wiley & Sons) for more information about variance estimates. >EPA ------- A,4,1, Questionnaire Design Following the approval of the Emergency ICR by the Office of Management and Budget on June 6, 2020, an EPA workgroup composed of staff from the Water Security Division and Drinking Water Protection Division of the Office of Ground Water and Drinking Water and from the Office of Wastewater Management developed the questionnaire with the assistance of the Cadmus Group, LLC. The workgroup developed and refined a list of questions for each section, with attention to formatting and phrasing the questions to ensure they were clear and would collect information that will be most useful. Cadmus programed the questionnaire in Qualtrics. The topics covered were: A. Demographics (basic information about the respondent) B. Supply chain issues C. Workforce issues D. Financial issues E. Analytical support issues F. Cybersecurity issues G. General information and concerns about the future Sections B-G of the survey included free-response questions, allowing respondents to explain how they developed their answers or provide any additional information they considered relevant. Senior Cadmus water sector experts reviewed the questionnaire for terminology that could potentially confuse utility staff and to estimate the length of time that would be required to complete it. Representatives of several water sector associations reviewed the questionnaire as well. EPA conducted a pre-test of the survey with four utility volunteers in August 2020 to test the emailing of survey links and the online data collection process and to get additional feedback on the content and language of the survey. All four of the utility volunteers provided written input, and two participated in a debriefing conference call as well. EPA made minor changes to the data collection approach and questionnaire as a result of the internal and external review and pre-test. Further minor changes were made to the questionnaire following review by Agency senior management in September 2020. A,4,2, Survey Administration EPA and water sector partners conducted outreach in advance of the survey to inform water and wastewater utilities of the data collection effort and explain the purpose of the survey. The survey began with a pilot-test "soft launch" on October 1, 2020 to an initial batch of 100 utilities from the sample, to test the email and data collection systems and the technical support system. The full launch followed on October 6, 2020. A second round of emails was sent on October 29 to increase the total number of respondents. Most responses were received in October and November; some trailing responses were received in December. Data collection ceased on December 31, 2020. Each survey participant received a unique link, along with instructions for responding, helpdesk information, an "opt out" link, and an invitation to contact EPA staff to verify the legitimacy of the survey. Each survey link was unique to enable survey tracking. EPA established a process to provide technical assistance and guidance to the survey respondents via a contractor-staffed helpdesk and a dedicated EPA email address. Follow-up email and phone contact was made with survey recipients to encourage them to respond. In some cases, state staff and other stakeholders also helped encourage survey participants to complete the survey. Helpdesk staff followed up periodically with each utility in the sample until the utility either responded to the survey or declined to participate. The helpdesk staff >EPA ------- provided technical assistance as necessary, and in some cases helped those having technical trouble to fill out the questionnaire. During review of data logged in the Qualtrics system, if a problem or question arose, Cadmus staff contacted the utility itself to resolve it. A,4,3, .Survey Response and Evaluation of Possible Non-Response Bias The data collection effort was closed out December 31, 2020. Of the 6,481 utilities sampled, 1,956 responded to the survey. The overall response rate was 30.2 percent. Non-response may introduce a source of bias into estimates based on the survey if non-respondents differ from respondents in some relevant way. The potential effect of non-response bias on estimates of the issues facing utilities can be positive or negative. On the one hand, systems that are struggling due to the pandemic may be unable to respond to the survey. On the other hand, systems not reporting significant problems may not see the need to participate. Therefore, both the sign (whether the bias is positive or negative) and the magnitude of non-response bias is uncertain. To evaluate the potential effect of the non-response bias, in mid-December of 2020 EPA contacted 37 utilities that were sampled but chose to not participate in the survey. EPA asked the utilities why they did not participate and whether they expected the pandemic to affect their operations in terms of supply chain, workforce, finances, analytic sampling, and cybersecurity. Of the 37 utilities successfully contacted, 10 reported that the COVID-19 pandemic was not an issue or concern. The remaining utilities indicated that they had declined to participate in the survey due to lack of time or resources. Thirty of the 37 utilities provided information about the potential effects of the pandemic. Their responses are shown in Table A.5. ------- ¦¦¦ \ e • ¦ " COVID-19 on Utilities that I " I Respond to the Survey Number (and Percentage) of Utilities Number (and Reporting that COVID- Percentage) of Utilities 19 Does Not Raise this Reporting that COVID- Issue or Concern Concern 19 Raises this Concern Total Supply Chain 22 (73%) 8 (27%) 30 (100%) Workforce 19 11 30 (63%) (37%) (100%) Finance 21 9 30 (70%) (30%) (100%) Analytic Support 29 (97%) 1 (3%) 30 (100%) Cybersecurity 28 (93%) 2 (27%) 30 (100%) Other 25 5 30 (83%) (17%) (100%) Any Issue 15 (50%) 15 (50%) 30 (100%) In general, the percentages of non-respondents reporting concerns are comparable to the survey findings. Based on the responses to the survey, roughly one-third of utilities had a supply chain concern during 2020, compared to 27 percent among non-respondents. Approximately one-quarter of utilities faced workforce shortages, and 37 percent of non-respondents reported workforce issues. Approximately one-third of utilities had lower net revenue than they budgeted, which is consistent with the 30 percent of non-respondents that reported financial concerns. Approximately one-tenth of utilities had conditions that interfered with their ability to complete sampling and one-tenth had issues with laboratory analysis, compared to the 3% of non-respondents that had concerns in these areas. The contacted non-respondents were more likely to report cybersecurity concerns (27 percent) than the 1 percent of utilities that experienced cybersecurity issues in 2020. On the other hand, the survey findings show that approximately one-quarter of the utilities have concerns about cybersecurity in the future. A.5. Quality Assurance The survey was conducted in accordance with an approved Programmatic Quality Assurance Project Plan. Because of the short timeline of the emergency information collection, a Supplemental Quality Assurance Project Plan (SQAPP) was not developed specifically for this survey. EPA followed the guidelines and procedures established in the DWINSA QAPP. EPA enacted specific measures to check and ensure the validity of the survey data from data collection through data processing and analysis, as well as measures to assure the quality of other survey components. A,5,1, Draft Questionnaire Pre-Test and Survey Pilot Test A significant component of the survey quality assurance plan was to thoroughly test the questionnaire design, the survey design, and the data collection procedures prior to implementing the full study. Efforts to confirm the validity and effectiveness of these designs and revise them when the tests reveal problems, errors, or difficulties led to design and process improvements in such areas as data reliability, data completeness, accuracy of the sample frame, and response rates. ------- Pre-test When EPA had identified the initial data collection objectives and completed a working draft of the questionnaire , EPA shared the draft with two groups of water sector experts: three contractor staff members and representatives of several water association partners/stakeholders. Improvements were made to the survey instrument based on the feedback received, and Cadmus staff provided an estimate of the time it would take utilities to complete the survey. Then EPA conducted a formal pre-test of the draft survey with four utility volunteers. The main objectives of the pre-test were to evaluate the clarity of the questions and to test some of the Qualtrics system's functionality for emailing links. The four utility volunteers reviewed the questionnaire in August 2020. Each provided written feedback, responding to a set of questions. The volunteers explored questions regarding comprehensibility, use of clear and appropriate terminology, provision of suitable response categories, and questionnaire layout. The reviewers also evaluated the ease or difficulty of providing answers, their immediate knowledge of or access to information requested by the questionnaire, and their overall reaction to the survey. Two of the volunteers not only provided written feedback but also participated in a debrief via conference call. Overall, the volunteers thought the questionnaire was clear and relatively easy to follow. As a result of the pre-test, some questions were re-worded. Otherwise, the pre-test found no systematic problems in respondents' ability to provide answers to the questions. Pilot Test The survey began with a "soft launch," which served as a full-scale pilot test of the survey instrument and data collection procedures. The test was conducted in October 2020. The soft launch provided an opportunity to test survey distribution, survey support, data systems, and logging of survey results on a limited number of participants before the full launch. One hundred utilities from the survey sample (40 CWSs, 40 WWTFs, and 20 Al utilities) were selected from the full sample for use in the pilot. Based on the experience of the soft launch, modest changes were made to the emails and the instructions for utilities for the full launch. A,5,2, Sampling Quality Assurance Quality assurance of the sampling process for the CWS Survey involved three principal areas: 1. Development of the sample frame 2. Sampling specifications 3. Use of software designed to draw samples Development of the Sample Frame EPA conducted an extensive review of the data used for the sample frame. For CWSs, the survey was able to take advantage of the extensive data verification effort undertaken for the 7th DWINSA. The DWINSA frame was developed with SDWIS data from the second quarter of 2019. State representatives working on the DWINSA reviewed their respective lists of systems from the data freeze and made changes to population and source categories as needed. The sample frame was then built using the data from the states. The 2012 CWNS was the basis for the WWTF sampling frame. For the COVID survey, EPA supplemented the CWNS data with information from ECHO to fill in gaps. The sampling frame for the Al and ANV utilities used the inventory of utilities developed in 2016-2018 for IHS evaluations of the operation and maintenance costs of tribally owned and operated American Indian and Alaska Native Village drinking water and wastewater utilities. The IHS inventory was based on the list of systems and wastewater facilities in SDWIS and in the IHS Operation and Maintenance Data System (OMDS). For the IHS project, the IHS Al utility inventory was reviewed and verified by the federal EPA and IHS staff, and >EPA ------- the ANV utility inventory was reviewed and verified by EPA Region 10 and state of Alaska. The COVID survey inventory of Al and ANV utilities was reviewed and verified by the federal and state staff. Sampling Specifications To carry out the sampling processes, the survey statisticians prepared written sampling plans that served as directions for performing the sampling and as a permanent documentation of the process. The specifications ensured the sample was drawn in conformity with the sample design and in a statistically valid manner. Sampling Software The sample of utilities was drawn using a Stata program designed to draw stratified random samples. EPA developed programs to draw the samples to ensure they could be replicated. A,5,3, Data Collection Quality Assurance Questionnaire Design The various drafts of the questionnaires were the product of close review and comment by EPA, its contractor, internal reviewers, and external stakeholder reviewers. Improvements also were made as a result of the pre-test and pilot test. Questionnaire version control was maintained through the various drafts by allowing for one master copy and strictly enforcing version-control procedures. After changes were made to each version of the questionnaire, a new electronic file with the date of the changes was created. Cadmus was responsible for making all changes and could track changes over previous versions of the questionnaire. The questionnaire form was iteratively reviewed and improved to make it clear and simple to use. The on-line survey design incorporated skip patterns so respondents would answer only the appropriate questions. Email Data Collection Analysts with extensive experience administering on-line surveys were responsible for the mailout. They were provided with clear specifications by a senior staff member. The Qualtrics system permitted the introductory email to be tailored to each utility, including providing each participant with a unique survey link. Each utility was assigned to one of 12 helpdesks, and each helpdesk was staffed by an analyst who maintained contact with the helpdesk's utilities throughout the survey. The analysts provided reminder calls and emails and technical support to the survey participants. The work of the analysts was overseen by a back-up team and senior survey managers, and when necessary the team elevated questions all the way to EPA. When survey support staff learned that a survey had been received by someone other than the intended utility recipient, they thanked the informant and instructed them not to complete the survey, and they made phone calls or conducted Internet research to find better contact information. The revised contact information was then entered into the project's data files. The affected surveys were closed down, and fresh links were emailed to the newly acquired email addresses. The tracking system ensured proper tracking and control of all questionnaires from the point of initial emails of links until the data were formally submitted. Periodic status reports from the Qualtrics system supported overall management of the project and also helped to identify response rate problem areas, which enabled EPA to take appropriate follow-up measures. >EPA ------- A,5,4, Data Processing Quality Assurance Qualities Data Extraction The completed survey results were extracted from the Qualtrics database and reviewed. Entries that needed to be removed (because they were from "test" surveys, or because there were duplicate entries from a single utility, or because someone began filling it out a survey before realizing it was received in error) were struck from the data set. The electronic data were then transformed into a hierarchical database for detailed analysis. Procedures were in place to maintain the integrity and quality of the data, as described below. Automated Data Validation Checks In preparing the final database, EPA and analyzed a series of computer validation checks. These checks were run on the full survey database after the data were entered and passed the standard computer edits for values and ranges on a variable-by-variable basis. The checks included the following: 1. Distribution frequencies for all categorical variables. The vast majority of the responses were categorical—either yes/no responses or choices from a short list of possible responses. 2. Distribution frequencies for all the continuous numerical variables (budgeted and actual revenue and expenses), formatted into four categories: non-zero responses, zero responses, legitimately skipped, and missing. 3. Univariates for each continuous variable. For the financial data reported in section D of the survey, EPA reviewed the distribution of responses to identify potential outliers and survey response errors. For example, analysts were able to identify cases where amounts were entered in millions of dollars rather than dollars by exploring the distribution of responses by stratum. EPA made the changes to the version of the data used in the analysis; EPA did not change the responses in the original dataset. Database Quality Assurance The final, clean survey database represented the product of the various review, editing, data entry, and data validation steps described above. Once the database was prepared, there were a number of subsequent data processing steps required to create a set of files to be used in the analyses and tabulations for the report. The principal steps included: 1. Appending needed variables from external files, including sample and contact information from the sampling frame. 2. Analyzing the questionnaires and the frequency distributions of continuous and categorical variables to devise rules for handling missing data. 3. Zero-filling blank responses. A detailed series of rules was developed for assessing blank responses and determining whether to regard these as zeros or missing values. A detailed set of programming specifications was designed to implement these rules. 4. Creating new derived variables from the survey data to categorize systems into strata comparable to the original sampling strata but based on the final survey responses about the size of the population served, rather than the sampling frame's data. 5. Recoding "Other" responses on multiple choice questions in cases where the written-in response was one of the multiple-choice options already available. ------- 6. Reviewing free-response answers to see if any information provided by respondents about how they answered specific questions would require recoding. 7. Attaching the sample weights and finite population correction to the analytical file. Each step was planned in advance. Early drafts of tables summarizing the survey responses were provided to EPA for review. The results were independently reviewed and verified. Version control was maintained for all custom computer code that EPA developed, and interim stages of all data files were archived. This meant that when changes were made to a program or process, it was clear which the current version was and the sequential changes that had been made from one version to the next were apparent. It was always possible to restore any earlier version in full or to merge selected data from the old version to the new version. The combination of the processing specifications, version control, and data archiving ensured that no process was irreversible, that it was always possible to recover from any deliberate or inadvertent changes to the data, and that the characteristics of the survey data were fully known at each processing stage. Tabulation Quality Assurance EPA produced the detailed summaries of each question by stratum (see Appendix C). The following steps were taken to help ensure that each table accurately summarized and presented the data contained in the final survey database. 1) Identify important, relevant, and useful information that could be developed from analyses of the survey data. 2) Design each table to effectively present the results. 3) Clearly describe the contents of each table. 4) Define in detail the variables, values, formulas, and derivations that go into each calculation. 5) Prepare clear and detailed data processing specifications for carrying out the tabulations according to the calculation definitions. 6) Develop computer programs to process the data pursuant to the tabulation specifications. 7) Review the initial output for: a) Consistency with the design of the table b) Conformity with the definitional and programming specifications c) Reasonable agreement with expected values-based on external measures and expert knowledge 8) Review definitions, specifications, programs, and underlying data for tabulations exhibiting data anomalies or outliers. 9) Review definitions, specifications, or programs if the review process identifies errors or the need for modifications to previous decisions. 10) Repeat previous tabulation quality assurance steps and re-run tabulations until no further unacceptable data anomalies are found. >EPA ------- The tabulation process was fully automated, from the underlying source data through all processing stages to the final formatted tables. There were no intermediate stages requiring manual transfer or entry of data from one stage to the next. This eliminated human transcription error. Of equal importance, it also expedited the process of successive iterations of the tabulations during the quality review process, as each time a table was produced the output data automatically were transferred into the same final table form as in the previous iteration. This ensured that any new anomalies identified in later iterations did not result from transcription errors and allowed the review staff to focus their investigations on the table data, specifications, and programs. In preparing data tables for Appendix C, EPA redacted fields that contain information that could be used to identify utility participants. These included fields with utility names and identification numbers, fields with contact information, fields generated by Qualtrics with geolocation data, and all free-text response fields. Fields with high-level geographical information like state or territory were retained. Before redacting fields with free-response answers, EPA reviewed them for illustrative quotations to use in the summary report. A,5,5, Quality Assurance During Report Preparation The findings presented in the report are based on the data tables presented in Appendix C and other Stata outputs. The analyses were conducted in the statistical package Stata using a series of programs (called "do files"). These programs were reviewed by at least two analysts and all changes were tracked and documented. Decisions to exclude outliers or other data from analyses were documented. Findings presented in the report, including quotations from free-response answers, were reviewed by Cadmus and EPA staff and revisions to the report were tracked. ------- |