EPA/601/R-14/006 I March 2015 I www.epa.gov/hfstudy United States Environmental Protection Agency Analysis of Hydraulic Fracturing Fluid Data from the FracFocus Chemical Disclosure Registry 1.0: Data Management and Quality Assessment Report United States Environmental Protection Agency Office of Research and Development ------- Data Management and Quality Assessment Report March 2015 [This page intentionally left blank.] ------- Data Management and Quality Assessment Report March 2015 Analysis of Hydraulic Fracturing Fluid Data from the FracFocus Chemical Disclosure Registry 1.0: Data Management and Quality Assessment Report U.S. Environmental Protection Agency Office of Research and Development Washington, DC March 2015 EPA/601/R-14/006 in ------- Data Management and Quality Assessment Report March 2015 Disclaimer This document has been reviewed in accordance with U.S. Environmental Protection Agency policy and approved for publication. Mention of trade names or commercial products does not constitute endorsement or recommendation for use. Preferred Citation: U.S. Environmental Protection Agency. 2015. Analysis of Hydraulic Fracturing Fluid Data from the FracFocus Chemical Disclosure Registry 1.0: Data Management and Quality Assessment Report. Office of Research and Development, Washington, DC. EPA/601/R-14/006. IV ------- Data Management and Quality Assessment Report March 2015 Table of Contents Disclaimer iv Table of Contents v List of Tables vi List of Figures vi Preface vii Acknowledgements viii List of Acronyms ix 1. Introduction 1 2. Source Data 1 3. Database Development 1 3.1. Downloading and Conversion 2 3.2. Extraction and Parsing 3 3.3. Output Data Structure 4 4. Assignment of Hydrocarbon Regions to Disclosures 7 5. Quality Assurance Process for Locational Data 10 6. Chemical Name Standardization 12 7. Data Field Descriptions 13 7.1. Data Fields in Main Tables 13 7.1.1. Well Header Field Descriptions 13 7.1.2. Ingredient Field Descriptions 20 7.2. Data Fields in Tables Associated with Standardizations 22 7.2.1. Chemical Name Standardization 22 7.2.2. Operator Standardization Information 22 7.2.3. Trade Name Standardization 23 7.2.4. Ingredient Purpose Standardization 23 7.3. Data Fields in Other Tables 24 7.3.1. Proppant Identification 25 7.3.2. Resin Coating Identification 25 7.3.3. CBI Identification 25 7.3.4. Water Source Identification 25 7.3.5. Purpose Categorization 26 7.3.6. State Regulation Information 26 ------- Data Management and Quality Assessment Report March 2015 7.3.7. County Information 27 7.3.8. Water Synonyms 27 7.3.9. Unparsed PDFs 27 8. Summary 28 References 29 List of Tables Table 1. Summary of parsing success 4 List of Figures Figure 1. Example FracFocus 1.0 disclosure 2 VI ------- Data Management and Quality Assessment Report March 2015 Preface The U.S. Environmental Protection Agency (EPA) is conducting a Study of the Potential Impacts of Hydraulic Fracturing for Oil and Gas on Drinking Water Resources. The study is based upon an extensive review of the literature; results from EPA research projects; and technical input from state, industry, and non-governmental organizations, as well as the public and other stakeholders. A series of technical roundtables and in-depth technical workshops were held to help address specific research questions and to inform the work of the study. In Fiscal Year 2010, Congress urged the EPA to examine the relationship between hydraulic fracturing and drinking water resources in the United States. The EPA's Plan to Study the Potential Impacts of Hydraulic Fracturing on Drinking Water Resources was reviewed by the agency's Science Advisory Board (SAB) and issued in 2011. The Study of the Potential Impacts of Hydraulic Fracturing on Drinking Water Resources: Progress Report, detailing the EPA's research approaches and next steps, was released in late 2012 and followed by a consultation with individual experts convened under the auspices of the SAB. This report, Evaluation of Hydraulic Fracturing Fluid Data from the FracFocus Chemical Disclosure Registry 1.0: Data Management and Quality Assessment Report, is the product of one of the research projects conducted as part of the EPA's study. It has undergone independent, external peer review, which was conducted through the Eastern Research Group, Inc. All peer review comments were considered in the report's development. The report has also been reviewed in accordance with agency policy and approved for publication. The EPA is writing a state-of-the-science assessment that integrates a broad review of existing literature, results from peer-reviewed EPA research products (including this report), and information gathered through stakeholder engagement efforts to answer the fundamental research questions posed for each stage of the hydraulic fracturing water cycle: • Water Acquisition: What are the possible impacts of large volume water withdrawals from ground and surface waters on drinking water resources? • Chemical Mixing: What are the possible impacts of surface spills on or near well pads of hydraulic fracturing fluids on drinking water resources? • Well Injection: What are the possible impacts of the injection and fracturing process on drinking water resources? • Flowback and Produced Water: What are the possible impacts of surface spills on or near well pads of flowback and produced water on drinking water resources? • Wastewater Treatment and Waste Disposal: What are the possible impacts of inadequate treatment of hydraulic fracturing wastewaters on drinking water resources? The state-of-the-science assessment is not a human health or an exposure assessment, nor is it designed to evaluate policy options or best management practices. As a Highly Influential Scientific Assessment, the draft assessment report will undergo public comment and a meaningful and timely peer review by the SAB to ensure all information is high quality. vn ------- Data Management and Quality Assessment Report March 2015 Acknowledgements The EPA would like to acknowledge the Ground Water Protection Council and the Interstate Oil and Gas Compact Commission for providing data and information for this report. Assistance was provided by The Cadmus Group, Inc., under contract EP-C-08-015. The contractor's role did not include establishing agency policy. Vlll ------- Data Management and Quality Assessment Report March 2015 List of Acronyms API American Petroleum Institute CASRN Chemical Abstracts Service Registry Number CBI Confidential Business Information CSV Comma-Separated Values EIA U.S. Energy Information Administration EPA U.S. Environmental Protection Agency FIPS Federal Information Processing Standards GIS Geographic Information System GWPC Ground Water Protection Council ID Identification IOGCC Interstate Oil and Gas Compact Commission NAD North American Datum PDF Portable Document Format QA Quality Assurance TVD True Vertical Depth USGS U.S. Geologic Survey WGS World Geodetic System XML Extensible Markup Language IX ------- Data Management and Quality Assessment Report March 2015 [This page intentionally left blank.] ------- Data Management and Quality Assessment Report March 2015 1. Introduction This report describes the procedures used to develop a database from data submitted to the FracFocus Chemical Disclosure Registry (subsequently referred to as "FracFocus") by well operators. The resulting project database was used to conduct the analyses described in the Analysis of Hydraulic Fracturing Fluid Data from the FracFocus Chemical Disclosure Registry 1.0 (subsequently referred to as the "data analysis report;" US EPA, 2015).: This data management report can be used in conjunction with the project database and data analysis report to reproduce the results presented in the data analysis report and to conduct additional analyses, if desired. 2. Source Data FracFocus is a publicly accessible website (www.fracfocus.org) managed by the Ground Water Protection Council (GWPC) and the Interstate Oil and Compact Commission (IOGCC) where oil and gas production well operators can disclose information about the composition of hydraulic fracturing fluids at individual wells.2 Disclosures included in the project database were submitted to FracFocus by well operators using the FracFocus 1.0 format and were provided in portable document format (PDF) to the U.S. Environmental Protection Agency (EPA) by the GWPC in March 2013.3 The PDF files were converted to Extensible Markup Language (XML) and parsed into a Microsoft Access database (Microsoft Corporation, 2012). Reviews of data quality were conducted on the project database to ensure that the results from analyses of the project database reflect the data contained in the original PDF disclosures, while identifying obviously invalid or incorrect data to exclude from analyses. The source data provided by the GWPC were a bulk archive of 39,136 disclosures in PDF format that were submitted to the FracFocus 1.0 website prior to March 1, 2013. Each disclosure was initially submitted by the well operator to FracFocus in the form of a Microsoft Excel spreadsheet and contained information on one production well that was hydraulically fractured with a single fracture date. Each Excel spreadsheet was then converted into a PDF file by the FracFocus website. 3. Database Development The initial development of the project database involved data conversion of disclosures from PDF format to XML files, parsing to extract information, and incorporation of the resulting data into a 1 The project database and the data analysis report are available at http://www2.epa.gov/hfstudy/published-scientific- papers. 2 Prior to February 28,2011, six of the 20 states with data in the project database began requiring operators to disclose chemicals used in hydraulic fracturing fluids to FracFocus (Colorado, North Dakota, Oklahoma, Pennsylvania, Texas, and Utah). Three other states started requiring disclosure to either FracFocus or the state (Louisiana, Montana, and Ohio), and five states required or began requiring disclosure to the state (Arkansas, Michigan, New Mexico, West Virginia, and Wyoming). Alabama, Alaska, California, Kansas, Mississippi, and Virginia did not have reporting requirements during the period of time studied in the data analysis report. Between February 5,2011, and April 13,2012, Pennsylvania required reporting to the state. As of April 14,2012, Pennsylvania required reporting to both the state and FracFocus. 3 FracFocus 2.0 became the exclusive disclosure mechanism in June 2013. More information on the FracFocus 1.0 FracFocus 2.0 formats may be found in the FracFocus 2.0 Operator Training materials available at http://fracfocus.org/node/331. ------- Data Management and Quality Assessment Report March 2015 Microsoft Access database. The subsequent steps to conduct quality assurance (QA) and the resulting tables and fields that are suitable for data analysis are described in Sections 4, 5, 6, and 7. In describing the database development in this report, underline formatting denotes table names, bold formatting denotes field names, and italic formatting denotes data values. 3.1. Downloading and Conversion The GWPC prepared a complete archive of all FracFocus 1.0 PDF disclosures (files) uploaded through February 28, 2013, and transferred the archive to the EPA. Adobe Acrobat Pro X (Adobe Systems Incorporated, 2011) was then used to convert all 39,136 PDF files in the archive to XML 2003 spreadsheet (Microsoft Excel 2003 XML) files. The conversion was performed because it is inherently difficult to extract data from PDF files, which are intended to provide consistent visual presentation across devices rather than structured representation of data for parsing and extraction. Tables of information in PDF files, in particular, can present a challenge for conversion. The source Microsoft Excel files, as uploaded by the operators, contained data in tables. However, in a PDF file, a table is essentially a series of lines and characters positioned on a page that, when assembled by PDF-reading software, appear as a table to the end user. To obtain tabular information from a PDF file, the PDF was converted to XML file format, which allows discrete data to be sorted into specific fields so that the data can be manipulated during analysis. Each FracFocus 1.0 disclosure contains two tables of information. Figure 1 shows an example of an individual well disclosure available to the public as a PDF. At the top of each disclosure is the well header table (outlined in blue in Figure 1), which contains the fracture date, well identifiers [i.e., Hydraulic Fracturing Fluid Product Component Information Disclosure Fracture Date: State: County: API Number: Well Name and Number: Longitude: Latitude: Long/Lat Projection: Production Type True Vertical Depth (TVD): Total Water Volume (gal): Hydraulic Fracturing Fluid Compositi Trade Name Water Sand Hydrochloric Acid Aceticplex 50 Plexgel 907L-EB Plexaid 430 Buffer 12 Plexgel Breaker HT Plexcide 24L Supplier Company A Company B Company B Company C Company A Company D Company B Company B Company C 1/10/2011 Texas Greer 99-123-45678 Company ABC Well XYZ -94.611274 27.035098 NAD27 Oil 14,637 3 107.561 Purpose Carrier/Base Fluid Proppant Acid Petrochemical industry: Oil Well Acidizing, Iron Seauesteranl Viscosifier for water Gel stabilizer pH buffer Encapsulated Oxidizing gel breaker Biocide Well Header Table ^ — ^^ Ingredients Table Ingredients Water Crystalline Silica Hydrogen Chloride Acetic Acid Distillate, petroleum, hydroireated light Propylene Pentamer C-11 to C-14 n-alkanes, mixed Sodium Thiosulfate Potassium Hydroxide Ammonium Persulfate Tetrahydro-3, 5-Dimethyl-2H-1 ,3, 5-Thiadiazine-2-Thione Sodium Hydroxide Chemical Abstract Service Number (CAS #) 7732-18-5 1 4808-60-7 7647-01 -0 64-1 9-7 64742-47-8 15220-87-8 Mixture 7772-9S-7 1 31 0-58-3 7727-54-0 533-74-4 1 31 0-73-2 Maximum Ingredient Concentration in Additive (by mass)** 100.00 100.00 40.00 50.00 6O.OO 60.00 60.00 30.00 23.00 90.00 24.00 4.00 Maximum Ingredient Concentration in HF Fluid (by mass)** 84 O9743 12.32189 1.09518 0.01187 O.21713 0.21713 0.21713 0.02214 0.04030 0.00144 0.01131 0.00189 Comments Figure 1. Example FracFocus 1.0 disclosure. ------- Data Management and Quality Assessment Report March 2015 American Petroleum Institute (API) number and well name], locational data, production type, true vertical depth (TVD) of the well, and the total water volume used to hydraulically fracture the well.4 The ingredients table (outlined in red in Figure 1) provides information on the trade names of the additives used in the hydraulic fracturing fluids, the supplier, and additive purpose. Each additive contains one or more ingredients, and the ingredients table includes the chemical name and Chemical Abstracts Service Registry Number (CASRN) for each ingredient, as well as the maximum concentrations as a percentage by mass in the additive and in the hydraulic fracturing fluid. 3.2. Extraction and Parsing A script was used to read the XML files, parse the relevant data, and compile those data into a useable format The parsing script was written in Python 2.7 (Python Software Foundation, 2012) and uses the Beautiful Soup 4 library (Richardson, 2013) to read the XML files. The script first locates and extracts the well header information for a given file. Generally, the fracture date appears first in a PDF, followed by other parameters in order. The script locates the first cell in the file that is of cell type "DateTime."5 The script then reads the columns below the date with the assumption that the other well header fields are ordered as anticipated from the disclosure template provided to well operators. In some cases, text wrapping in the original PDFs will split values into multiple rows, resulting in extra header cells. To address this, the position of the longitude field, which is always a negative number for locations within the United States, is used as a "landmark" to recalibrate the ordering of data fields. The script parses information from the ingredients table by locating individual columns of information and then reading cells in that column until the bottom of the table is reached. The bottom of the table is either the last row with more than one cell or the last row in the sheet Columns are located by searching for text patterns that indicate the presence of a column header. In developing the script, the text patterns were refined based on experience; some operators represent the same column of information differently. For the data fields Purpose and Trade Name in the ingredients table of the disclosure (Figure 1), operators generally enter a value once to indicate that an additive trade name or purpose applies to all ingredients that follow (e.g., additive "Plexgel 907L-EB" in ingredients table of Figure 1). Thus, a purpose and trade name are applied to ingredients until a new trade name and purpose are encountered. Blank values in the purpose and trade name columns are replaced with the previous value as the column is parsed. The parsing approach is highly sensitive to formatting. If an operator departed from the FracFocus 1.0 template when originally creating a disclosure, the disclosure may have been skipped or information from the disclosure may have parsed incorrectly. Most of the disclosures were prepared in a consistent format that enabled relatively easy parsing of data. However, some disclosures were uploaded using templates modified by the operators, with columns added or 4 More information on the field descriptions may be found in Section 7.1.1. 5Adobe Acrobat identified apparent dates and standardized them automatically. The standardization in this dataset was later reversed, because Acrobat occasionally "standardized" non-date values. ------- Data Management and Quality Assessment Report March 2015 removed, fields left blank, or invalid data entered. The modified disclosures were problematic during parsing and QA. Table 1. Summary of parsing success. Well header parsed Yes or No Yes Yes Yes No Ingredient table parsed Yes or No Yes or No Yes No No Number of disclosures 39,136 38,530 37,017 1,513 606 Percentage of disclosures 100% 98.5% 94.6% 3.87% 1.55% Note: "Yes" and "No" indicate whether portions of the disclosures (well header or ingredient table) were successfully parsed. "Yes or no" indicates that the disclosure counts include disclosures that were parsed and those that were not. As shown in Table 1, the well header table was successfully parsed from 98.5% of disclosures (38,530 of 39,136), and both the well header and ingredient tables were successfully parsed from 94.6% of disclosures (37,017 of 39,136). 3.3. Output Data Structure The script parsed the resulting data into two comma-separated value (CSV) files that form the foundation of the project database. One file contains the well operator, well identifiers, production, and locational data from the well header; the other file contains the additive, additive purpose, chemical, and chemical concentration data from the ingredients table. The two-table structure was considered appropriate because a one-to-many relationship exists between the well header values for an individual disclosure and the multiple values from the ingredients table that correspond to that disclosure. The two tables are linked in the project database by a constructed unique identification (ID) field. The ID field is necessary because the combinations of API Well Number and Fracture Date for 228 disclosures were found to be duplicated in the dataset and, thus, cannot serve as unique identifiers. Unique disclosures—defined by the combination of API Well Number and Fracture Date—were selected from duplicate disclosures by choosing the file with the most recent modification date. The modification date associated with each PDF is not information found on the publicly available disclosure that may be downloaded from FracFocus. If two or more records shared the same values for API Well Number and Fracture Date, then the PDF file with the most recent modification date was flagged as the authoritative disclosure. To maximize the transparency of the QA effort, the final database contains two versions of the data extracted from the FracFocus 1.0 disclosures. The first version contains data as originally parsed without any formatting, spelling corrections, or standardization—these tables are denoted with the "Original" prefix in their names. The values in these tables were taken directly from the CSV files produced by the parsing script and are stored verbatim as text The second version contains data ------- Data Management and Quality Assessment Report March 2015 after formatting, corrections, and standardization were performed—these tables are denoted with the "Qa" prefix. The "Qa" tables also contain fields describing the adjustments made to each disclosure and whether the values met QA criteria. The two-version structure enabled straightforward review of all changes and streamlined tracing of disclosures back to the source data. The primary tables in the project database are as follows: • OriginalWell. Well header data with verbatim (unadjusted) values as parsed to input data. • QaWell. Well header data with minor adjustments applied, including fixed typographical errors, removal of extraneous characters, and corrections of obvious transpositions (e.g., latitude and longitude swapped, state and county swapped). Columns accompanying each set of well header values, also referred to as QA flag fields, describe adjustments made to the OriginalWell data and whether the data met QA criteria as included in the QaWell table. • Originallngredient. Ingredients data with verbatim (unadjusted) values as parsed to input data. • Oalngredient. Ingredients data with minor adjustments applied, including corrected formatting of CASRNs and standardized suppliers. Similar to the table QaWell. the Qalngredient table includes QA flag fields that describe the adjustments made and whether the data met QA criteria for inclusion in analyses. Additional tables in the database supporting the QA efforts and data analyses include the following: • IngredientNameStandardization. Ingredient names were standardized using a list of chemical names paired with CASRNs compiled by the EPA. These standardized names are used in the Qalngredient table. • PurposeStandardization. Additive purpose names were standardized and applied to the Oalngredient table to correct for spelling capitalization, spaces, and punctuation for most purpose entries. Synonyms for proppants and base fluids are also identified in this table. • PurposeCategorization. Categorization of related additive purposes was applied to the standardized purposes for ease of summarizing the data during analyses. Information from this table was used for queries in which summary information was compiled regarding additive purposes. • TradeNameStandardization. Standardized additive trade names were applied to values in the TradeName field to correct for spelling, capitalization, spaces, and punctuation and are used in the Qalngredient table. • OperatorStandardization. Standardized operator names were applied to values in the Operator data to consolidate different representations of operator names and are used in the Qalngredient table. • StateRegulation. This table lists effective dates for state laws that either mandate disclosure of hydraulic fracturing chemicals to FracFocus, allow FracFocus as an alternative to ------- Data Management and Quality Assessment Report March 2015 reporting to state agencies, or require reporting to state agencies. (This information was obtained through separate research and is not information reported by operators to FracFocus.) • Counties. This table provides a listing of all counties in the United States by state, name, and Federal Information Processing Standards (FIPS) code. This table also includes a separate identifier for the five case study counties included in the data analysis report. • CBISynonym. A list was compiled of terms interpreted to indicate confidential business information (CBI) in the Chemical Name and Cas fields of ingredient records. This table was used for analyses of ingredient data reported as CBI or an associated term (such as 'proprietary/ 'trade secret/ etc.). • Proppants. This table provides a listing of solid materials associated with proppant-related additive purposes and indicates whether these materials should be excluded from additive ingredient analyses conducted for the data analysis report.6 The table is not associated with any changes or standardizations in the Qalngredient table, but was referenced in queries for chemicals. • ResinCoating. This list contains ingredients associated with proppant-related additive purposes; these are ingredients that are not minerals, but rather chemicals associated with resin coatings on proppants. The list was referenced in queries for the proppants and additive ingredients analyses discussed in the data analysis report and is not associated with any changes or standardizations in the Qalngredient table. • WaterSourceTerm. This list of terms is interpreted to indicate water sources reported by operators in the TradeName and Comments fields that are included in the Qalngredient table. These terms were used for the water source analysis described in the data analysis report. • UnparsedPDFs. This table lists the PDFs that were unable to be parsed. It is incorporated for transparency and reference. • WaterSynonyms. This list contains variations of operator entries (e.g., in the TradeName, Comments, or ChemicalName fields in Qalngredient) that indicate water but no other descriptors for the water source for base fluids. This list was used in querying for water sources. An ingredient record could match a term on this list only if it did not already match a term in WaterSourceTerm. Section 7 describes the specific data fields found in these tables. Sections 4, 5, and 6, respectively, discuss the incorporation of geospatial data into the database, the QA procedures for well locational data, and the standardization of chemical names. 6 Additive ingredients are defined as ingredients reported for additives that have purposes other than base fluid or proppant. ------- Data Management and Quality Assessment Report March 2015 4. Assignment of Hydrocarbon Regions to Disclosures Operators reported the production type (oil or gas) on FracFocus 1.0 disclosures, but not the specific producing formation. To offer basic geologic context for the locations of the disclosures, the hydrocarbon regions underlying each disclosure's latitude and longitude coordinates were added to the QaWell table after conversion of the coordinates to the North American Datum 83 (NAD83) in Esri ArcGIS v. 10.1 geographic information system (CIS; Esri, 2012). National-scale spatial data describing the areal extent of hydrocarbon regions are limited—local and regional studies are more common. Five publicly available datasets with national coverage were chosen to be spatially joined to well locations. The National Oil and Gas Assessment province boundaries shapefile was obtained from the U.S. Geological Survey (USGS; USGS, 1995), and shapefiles for coalbed methane basins, tight gas basins, and shale gas plays and basins were obtained from the U.S. Energy Information Administration (EIA; US EIA, 2007, 2011a, b). These datasets were used for general reference purposes and with the understanding that the boundaries are approximate and that production may not be occurring from the co-located play. The following text boxes describe the content of these databases and provide links to metadata and file download locations. USGS Oil and Gas Provinces Field name Description Metadata Download USGSProvinces Thisdataset includes 71 very large oil and gas provinces delineated as part of the USGS's 1995 National Oil and Gas Assessment (USGS, 1995). Although this layer has coarse spatial resolution, it has the advantage of covering the entire lower 48 states plus Alaska, which means that (nearly) every disclosure in the project database will be located within a province. http://certmapper.cr. usgs.gov/geoportal/catalog/search/resource/details. page?uuid=%7B50B9 6CAA-20BD-4875-B3B2-BB3ElE6BlCD9%7D http://certmapper.cr.usgs.gov/data/noga95/natl/spatial/shape/pr natlg.zip EIA Shale Basins Field name Description Metadata Download ShaleBasin This dataset includes 32 major sedimentary basins that contain hydrocarbon-bearing shales and correspond to the translucent pink "Basins" in the EIA "Lower 48 States Shale Plays" map. http://www.eia.gov/pub/oil gas/natural gas/analysis publications/maps/maps.htm http://www.eia.gov/pub/oil gas/natural gas/analysis publications/maps/shalegasbasin.zip ------- Data Management and Quality Assessment Report March 2015 EIA Shale Plays Field name Description Metadata Download ShalePlay This dataset includes 45 shale plays that correspond to the translucent orange "Current Plays" and yellow "Prospective Plays" in the EIA "Lower 48 States Shale Plays" map. http://www.eia.gov/pub/oil gas/natural gas/analysis publications/maps/maps.htm http://www.eia.gov/pub/oil gas/natural gas/analysis publications/maps/shalegasplay.zip EIA Tight Gas Basins Field name Description Metadata Download File in ZIP archive: TightGas This dataset includes 13 sedimentary basins that contain tight gas formations and correspond the translucent pink "Basins" in the "Major Tight Gas Plays, Lower 48 States" map. http://www.eia.gov/pub/oil gas/natural gas/analysis publications/maps/maps.htm to http://www.eia.gov/pub/oil gas/natural gas/analysis publications/maps/tightgasbasinplay.zip TightGasBasins_EIA_June2010.shp EIA Coalbed Methane Basins Field name Description Metadata Download File in ZIP archive: CoalBed This dataset includes 98 sedimentary basins that contain coalbed methane and correspond to the translucent pink "Coal Basins, Regions & Fields" in the "Coalbed Methane Fields, Lower 48 States" map. http://www.eia.gov/pub/oil gas/natural gas/analysis publications/maps/maps.htm http://www.eia.gov/pub/oil gas/natural gas/analysis publications/maps/cbm 4shps.zip CBMbasins_Reserv06_Prod06.shp ArcGIS 10.1 software was used for the spatial join process. The ArcGIS for Desktop Basic license includes the "Spatial Join" geoprocessing tool, which is routinely used to link the attributes of multiple sets of spatial data. In this case, the hydrocarbon regions were "join features," and the disclosure locations were the "target features." The disclosure locations were determined by the latitude and longitude coordinates in the project database (after QA and conversion to NAD83 ------- Data Management and Quality Assessment Report March 2015 datum, as described in Section 5), corresponding to the NAD83_Lon and NAD83_Lat fields in the OaWell table of the database. The "Join Operation" parameter was "JOIN_ONE_TO_ONE" and the "Match Option" parameter was set to "INTERSECT," such that a disclosure must spatially intersect the join feature in order to be assigned its value. The assignment of a hydrocarbon region to a disclosure record in the database is meant to give context to the disclosure location and is likely to be more reliable at the basin scale than at the play scale. Interpretations of the analysis results do not assume that the wells at the disclosure locations are producing from any of the co-located shale plays assigned by this spatial join. Another limitation in accurately assigning plays to the disclosure locations is that the EIA geospatial data do not include boundaries for tight sand plays or coalbed plays; only basin boundaries are available from EIA for these two types of unconventional plays. Therefore, in areas with stacked plays that include sands or coalbeds in addition to shales, it is not possible to determine whether the producing formation is a shale play or another formation based solely on the locational data and the spatial join. Also, comparable EIA geospatial data were not available for oil basins. For 4,644 disclosures (12% of 38,530 disclosures), the disclosure locations were within the surface boundaries of two EIA shale plays (i.e., plays with active production that are at different depths in the same general surface area, also known as "stacked plays"). Because operators do not report the play or formation that is being hydraulically fractured, there is ambiguity regarding the appropriate formation for the disclosure. Although operators provided TVDs, it is unknown if some of these values may include lateral lengths. Given the limitations of the TVD data, they were not used to interpret formation in regions with stacked plays in cases of shale play overlap for a location. Therefore, the value assigned to the ShalePlay field of the QaWell table is a combination of the individual shale play names, delimited by forward slashes (e.g., Avalon-Bone Spring/Barnett- Woodford. Arthur et al. (2014) and Carter et al. (2013) summarized data from FracFocus by plays by assuming that the geographic placement of disclosures approximated the geologic placement in popular production plays. Before using the same strategy to categorize results in the data analysis report, the accuracy of geospatial information in identifying plays associated with disclosures was assessed. The results of the spatial join were compared with analogous information from the commercial database Drillinglnfo (Drillinglnfo, 2011). Because the EIA geospatial data used for the spatial join included play-level boundaries for shales but not for tight sands or coalbeds (these were only delineated at the basin level), the comparison was limited to shales. Drillinglnfo is populated using state databases and includes information on producing formations. It includes API well numbers that correspond to 7,761 disclosures in the project database. Of the 7,761 disclosures, 7,153 are co-located with the EIA boundaries for shale plays. Among these 7,153 disclosures, 83% had EIA shale play designations generally consistent with the operator-identified formations in Drillinglnfo. Among the 17% of disclosures for which the EIA shale plays did not match the Drillinglnfo formations, the mismatches generally occurred where there are stacked plays that include shales in addition to tight sands or coalbeds, and the producing formation is a sandstone, limestone, or coal-bearing formation. ------- Data Management and Quality Assessment Report March 2015 At this time, the basin designations provide useful context for the project database, but shale play designations should be regarded with care in areas with stacked producing plays. Ultimately, the data were not summarized by play in the data analysis report to be consistent with the analysis of the data "as is." 5. Quality Assurance Process for Locational Data The well header table in each disclosure includes three sources of locational data: • State name and county name information, as stored in the StateFFQA and CountyFFQA fields, respectively, of the QaWell table. • State and county information encoded in the first five digits of the API Well Number, as stored in the APIFFQA field of the QaWell table. • Latitude and longitude coordinates in the well header, as stored in the LatitudeFFQA and LongitudeFFQA fields, respectively, of the QaWell table. The datum of the coordinates is stored in the ProjectionFFQA field of the QaWell table. Because the three locational sources were easily available and comparable, the location was determined to have met QA criteria if all three locational data fields agreed.7 To validate the location of each disclosure, the state and county entries for each of these three fields were compared. First, the leading five digits from APIFFQA were converted to state and county names usinglookup tables from the Society of Petrophysicists and Well Log Analysts (2010). Second, the states and counties that intersect the coordinates reported in the LatitudeFFQA and LongitudeFFQA fields were determined using ESRI ArcGIS 10.1 software. Due to the varying datums entered in the ProjectionFFQA field, four separate shapefiles were created: • Disclosures with a NAD83 projection were read into a point shapefile with the North American Datum of 1983 geographic coordinate system. • Disclosures with a WGS84 projection were read into a point shapefile with the World Geodetic System Datum of 1984 geographic coordinate system, and then transformed to NAD83 via the "NAD_1983_To_WGS_1984_l" datum transformation with the Project geoprocessing tool. • Disclosures with a NAD27 projection in the lower 48 United States were read into a point shapefile with the North American Datum of 1927 geographic coordinate system, and then transformed to NAD83 via the "NAD_1927_To_NAD_1983_NADCON" datum transformation with the Project geoprocessing tool. • Disclosures with a NAD27 projection with a StateFFQA listed as Alaska were read into a point shapefile with the North American Datum of 1927 geographic coordinate system, and 7 Well locations in Alaska were not subject to county-level locational QA criteria, because the five-digit API well numbers in Alaska are not organized by counties. The coordinates for all disclosures from Alaska fall within the boundaries of the North Slope borough. 10 ------- Data Management and Quality Assessment Report March 2015 then transformed to NAD83 via the "NAD_1927_To_NAD_1983_Alaska" datum transformation with the Project geoprocessing tool. Following datum transformations to NAD83, these four shapefiles were merged into a single shapefile using the Merge geoprocessing tool. The final latitude and longitude coordinates (after transformation to NAD83, if needed) were stored in the NAD83_Lat and NAD83_Lon fields, respectively, in the QaWell table. To join state and county names to each disclosure location, the Spatial Join geoprocessing tool was used with the 2010 TIGER/Line shapefile of counties from the US Census Bureau (USCB, 2011) with the "Join Operation" parameter set to "JOIN_ONE_TO_ONE" and the "Match Option" parameter set to "INTERSECT." The resulting attribute table was exported to Microsoft Excel (Microsoft Corporation, 2002). In Excel, the three sets of state and county locations were compared, resulting in six QA measures for the locational data. These comparisons were case-insensitive to avoid situations where, for example, the data values Mckee and McKee would not match. These comparisons also ignored spaces and hyphens to avoid situations where, for example, Me Kee and McKee would not match. For each of the six comparisons, a QA flag field was added to the data table with True or False Boolean values: • StateMatchAPI_FF indicates whether or not the API code for the state (APIState) matches the state reported in the well header table (StateFFQA). • StateMatchGIS_FF indicates whether or not the state that contains the CIS-mapped disclosure location (GISState) matches the state reported in the well header table (StateFFQA). • StateMatchAPI_GIS indicates whether or not the API code for the state (APIState) matches the state that contains the CIS-mapped disclosure location (GISState). • CountyMatchAPI_FF indicates whether or not the API code for the county (APICounty) matches the county reported in the well header table (CountyFFQA). • CountyMatchGIS_FF indicates whether or not the county that contains the CIS-mapped disclosure location (GISCounty) matches the county reported in the well header table (CountyFFQA). • CountyMatchAPI_GIS indicates whether or not the API code for the county (APICounty) matches the county that contains the CIS-mapped disclosure location (GISCounty). Based on these six fields, two additional flags were added: • AHStateOK is True if all three state comparison fields are True. • AHCountyOK is True if all six state and county comparison fields are True. Locational data were used in the data analysis report for analyses in which information was needed at the state or county level. The QA-related fields were used as appropriate to either exclude 11 ------- Data Management and Quality Assessment Report March 2015 disclosures that did not meet QA criteria from analyses or to categorize results with uncertain locational information. 6. Chemical Name Standardization Ingredient names and CASRNs are entered by operators in the ingredients table, and the names can include a wide range of variations for a given ingredient, including synonyms, misspellings, different punctuations and formatting, and different alpha-numeric spacing. To identify ingredients used in hydraulic fracturing fluids, entries of both ingredient names and CASRNs were verified and standardized. The CASRNs were determined valid for analyses after being verified with the Chemical Abstracts Service (2014); ingredient records with invalid CASRNs were excluded from certain analyses presented in the data analysis report. Note that this approach assumes that the CASRN entered into the project database is correct. Ingredient names for verified CASRNs were standardized using a list of unique chemical names paired with CASRNs developed by the EPA. This standardization was needed because of the above- noted range of presentations of ingredient names. Because the ingredient names were standardized, the names found in the data analysis report and the project database may differ from the names reported by operators in the original PDF disclosures. The EPA used standardized chemical names from Appendix A in the agency's Study of the Potential Impacts of Hydraulic Fracturing on Drinking Water Resources: Progress Report (2012) for the EPA- standardized chemical names used in the project database and in this report.8 Chemical name and structure quality control methods were used to standardize chemical names for CASRNs found in the project database, but not included in Appendix A of the Progress Report."3 The same methods were used in the development of Appendix A of the Progress Report and ensure correct chemical names and CASRNs. 8 Table A-l in the Progress Report. 9 In the majority of cases, valid CASRNs and the associated ingredient names in the project database were paired correctly for a given CASRN. If an ingredient name (whether specific or non-specific) did not match the CASRN reported by the operator, the CASRN was added to a chemical name standardization list and assigned a correct chemical name. The chemical standardization list consists of CASRNs paired with appropriate chemical names and was used to standardize chemical names in the project database based on the CASRNs reported by the operators. This process was undertaken because numerous synonyms and misspellings for a given chemical were present in the original data. Standardized, specific chemical names were identified using the EPA's Distributed Structure-Searchable Database Network (US EPA, 2013), the EPA's Substance Registry Services database (US EPA, 2014a), and the U.S. National Library of Medicine ChemID database (US NLM, 2014). Additional information on chemical name and structure quality control methods can be found at http://www.epa.gov/ncct/dsstox/ChemicalInfQAProcedures.html. 12 ------- Data Management and Quality Assessment Report March 2015 7. Data Field Descriptions The sections below provide a listing and descriptions of the data fields in the project database tables. 7.1. Data Fields in Main Tables The primary tables that contain the data from the disclosures are: • OriginalWell • QaWell • Originallngredient • Qalngredient The two "Original" tables contain the data as parsed from the original PDF disclosures. In the two "Qa" tables, data have undergone basic standardization, and a series of QA flag fields has been established to facilitate analyses. Fields with "QA" or "flag" in their names are in the "Qa" tables. 7.1.1. Well Header Field Descriptions This section lists the fields in the OriginalWell and QaWell tables, which contain information derived from the 38,530 disclosures with successfully parsed well headers. For convenience, these are grouped into relevant categories based on the well header source field. Well ID Wellld A unique identifier assigned to each disclosure that was parsed into the project database Fracture Job Date DateFF DateFFQA DateFFflag The verbatim fracture date from the parsed disclosure DateFF after minor editing to correct obvious typos, incorrect formatting, and remove invalid values OK OK, formatted Early Late Unclear 38,277 disclosures (99.34%) with DateFF unchanged 2 disclosures (0.0052%) with DateFF reformatted to fix an obvious typo 222 disclosures (0.58%) with DateFF before 1/1/2011 (the first day of the study period), which resulted in a blank for these disclosures in the DateFFQA field 28 disclosures (0.073%) with DateFF after 2/28/2013 (the last day of the study period), which resulted in a blank for these disclosures in the DateFFQA field 1 disclosure (0.0026%) with DateFF that could not be read, which resulted in a blank for these disclosures in the DateFFQA field 13 ------- Data Management and Quality Assessment Report March 2015 State StateFF StateFFQA StateFFflag The verbatim state name from the parsed disclosure StateFF after minor editing to correct obvious typos and differences in formatting OK OK, misspelled OK, postal to full 33,699 disclosures (87.46%) with StateFF unchanged 38 disclosures (0.099%) with StateFF corrected to fix an obvious typo 4,793 disclosures (12.44%) with StateFF corrected to substitute postal code (e.g., TX changed to Texas) County CountyFF CountyFFQA CountyFFflag The verbatim county name from the parsed disclosure CountyFF after minor editing to correct misspelled County names, remove extraneous "County" and "Parish" suffixes, and remove invalid values OK OK, misspelled OK, shortened Unclear 36,758 disclosures (95.40%) with CountyFF unchanged 563 disclosures (1.46%) with CountyFF corrected to fix an obvious typo 1,206 disclosures (3.13%) with CountyFF corrected to remove extraneous suffixes (e.g. County, Parish, Borough) 3 disclosures (0.0078%) with CountyFF that was omitted or otherwise erroneous, which resulted in a blank for these disclosures in the CountyFFQA field API Well Number APIFF APIFFQA APIFFflag The verbatim API well number from the parsed disclosure APIFF after minor editing to include leading zeroes and add hyphens OK OK, formatted Different than filename 29,168 disclosures (75.70%) with APIFF unchanged 9,352 disclosures (24.27%) with APIFF reformatted to include leading zeroes and add hyphens 10 disclosures (0.026%) with APIFF different than the API well number embedded in the PDF filename 14 ------- Data Management and Quality Assessment Report March 2015 Operator OperatorFF OperatorFFQA OperatorFFflag The verbatim well operator from the parsed disclosure OperatorFF after minor editing to aggregate synonymous and misspelled operator names OK OK, mapped 9,935 disclosures (25.79%) with OperatorFF unchanged 28,595 disclosures (74.21%) with OperatorFF changed to a synonym based on the OperatorStandardization table Well Name NameFF NameFFQA NameFFflag The verbatim well name from the parsed disclosure Matches NameFF because no values required editing OK 38,530 disclosures (100.0%) with NameFF unchanged Longitude LongitudeFF LongitudeFFQA LongitudeFFflag The verbatim longitude from the parsed disclosure LongitudeFF after minor editing to correct obvious typos and transpositions, and to remove invalid values OK OK, lat/lon swapped OK, nonnegative Unclear 38,394 disclosures (99.65%) with LongitudeFF unchanged 4 disclosures (0.010%) with LongitudeFF clearly transposed with latitude 129 disclosures (0.33%) with LongitudeFF erroneously non-negative but otherwise valid 3 disclosures (0.0078%) with LongitudeFF likely erroneous based on the resulting map location, which resulted in a blank for these disclosures in the LongitudeFFQA field 15 ------- Data Management and Quality Assessment Report March 2015 Latitude LatitudeFF LatitudeFFQA LatitudeFFflag The verbatim latitude from the parsed disclosure LatitudeFF after minor editing to correct obvious typos and transpositions, and to remove invalid values OK OK, lat/lon swapped OK, negative Unclear 38,518 disclosures (99.97%) with LatitudeFF unchanged 4 disclosures (0.010%) with LatitudeFF clearly transposed with longitude 5 disclosures (0.013%) with LatitudeFF erroneously negative but otherwise valid 3 disclosures (0.0078%) with LatitudeFF likely erroneous based on the resulting map location, which resulted in a blank for these disclosures in the LatitudeFFQA field Projection ProjectionFF ProjectionFFQA ProjectionFFflag The verbatim projection (technically a datum) from the parsed disclosure Matches ProjectionFF because no values required editing OK 38,530 disclosures (100.0%) with ProjectionFF unchanged Production Type (oil or gas) TypeFF TypeFFQA TypeFFflag The verbatim production type from the parsed disclosure Matches Type FF OK because no values required editing 38,530 disclosures (100.0%) with TypeFFQA unchanged 16 ------- Data Management and Quality Assessment Report March 2015 True Vertical Depth DepthFF DepthFFQA DepthFFflag The verbatim true vertical depth (in feet) from the parsed disclosure DepthFF after minor formatting to remove units, average ranges, and remove invalid values OK OK, formatted Range High Low Not given 37,721 disclosures (97.90%) with DepthFF unchanged 81 disclosures (0.21%) with DepthFF formatted to remove units and other extraneous characters 5 disclosures (0.013%) with DepthFF given as a range, which resulted in the DepthFFQA value being averaged from the minimum and maximum range values 14 disclosures (0.036%) with DepthFF greater than 25,000 feet, the upper threshold identified by the EPA for reasonable depths, which results in a blank for these disclosures in the DepthFFQA field 5 disclosures (0.013%) with DepthFF less than 500 feet, the lower threshold identified by the EPA for reasonable depths, which results in a blank for these disclosures in the DepthFFQA field 704 disclosures (1.83%) with DepthFF not reported, which results in a blank for these disclosures in the DepthFFQA field 17 ------- Data Management and Quality Assessment Report March 2015 Total Water Volume VolumeFF VolumeFFQA VolumeFFflag The verbatim total water volume (in gallons) from the parsed disclosure VolumeFF after minor formatting to remove units and remove invalid values OK OK, formatted OK, revised Empty, revised High Not given Unclear 38,108 disclosures (98.90%) with VolumeFF unchanged 27 disclosures (0.070%) with VolumeFF formatted to remove units and other extraneous characters 140 disclosures (0.36%) with VolumeFF revised due to altered header format 32 disclosures (0.083%) with VolumeFF removed due to altered header format 11 disclosures (0.029%) with VolumeFF greater than 50 million gallons (upper threshold set by the EPA), which results in a blank for these disclosures in the LatitudeFFQA field 133 disclosures (0.35%) with VolumeFF not reported, which results in a blank for these disclosures in the LatitudeFFQA field 79 disclosures (0.21%) with VolumeFF given but not valid numbers, which results in a blank for these disclosures in the LatitudeFFQA field Duplication APICount In table QAWell, the number of disclosures with this API well number. A total of 2,283 disclosures (5.93%) shared an API well number with at least one other disclosure. Authoritative In table QAWell, True if the disclosure is the authoritative disclosure among a set of duplicates with the same APIFFQA and DateFFQA, as determined by the folder date or file creation date. A total of 38,301 disclosures (99.41%) matched are authoritative. 18 ------- Data Management and Quality Assessment Report March 2015 Locational Data from API Well Number APIState APICounty In table QAWell, the name of the State associated with the first two digits of the API Well Number in the APIFFQA field. The State associations were downloaded from http://www.spwla.org/technical/us-state-codes. In table QAWell, the name of the County associated with the first five digits of the API Well Number in the APIFFQA field. The County associations were downloaded from http://www.spwla.org/xls/counties.xls. Locational Data from CIS Spatial Join of Longitude/Latitude Coordinates NAD83_Lon NAD83_Lat GISState GISCounty USGSProvince ShaleBasin ShalePlay TightGas CoalBed The LongitudeFFQA coordinate, after being converted to the NAD83 datum The LatitudeFFQA coordinate, after being converted to the NAD83 datum The name of the state in which the NAD83_Lat and NAD83_Lon are located. Coordinates did not intersect a state in 56 disclosures (0.15%), resulting in blank values for GISState field. The name of the county in which the NAD83_Lat and NAD83_Lon are located. Coordinates did not intersect a county in 56 disclosures (0.15%), resulting in blank values for GISCounty field. The name of the USGS Oil and Gas Province coincident with the disclosure's coordinates. Coordinates did not intersect a USGS province in 56 disclosures (0.15%), resulting in blank values for USGSProvince field. The name of the EIA Shale Basin coincident with the disclosure's coordinates. Coordinates did not intersect a shale basin in 1,120 disclosures (2.91%), resulting in blank values for ShaleBasin field. The name of the EIA Shale Play coincident with the disclosure's coordinates. Coordinates did not intersect a shale play in 14,894 disclosures (38.66%), resulting in blank values for ShalePlay field. The name of the EIA Tight Gas Basin coincident with the disclosure's coordinates. Coordinates did not intersect a tight gas basin in 4,170 disclosures (10.82%), resulting in blank values for TightGas field. The name of the EIA Coal Bed Methane Basin coincident with the disclosure's coordinates. Coordinates did not intersect a coalbed methane basin in 20,534 disclosures (53.29%), resulting in blank values for CoalBed field. 19 ------- Data Management and Quality Assessment Report March 2015 State Locational Matching StateMatchAPI_FF StateMatchGIS_FF StateMatchAPI_GIS True if APIState matches StateFFQA. The two field values matched for 38,476 disclosures (99.86%). True if GISState matches StateFFQA. The two field values matched for 38,390 disclosures (99.64%). True if APIState matches GISState. The two field values matched for 38,381 disclosures (99.61%). County Locational Matching CountyMatchAPI_FF CountyMatchGIS_FF CountyMatchAPI_GIS True if APICounty matches CountyFFQA. The two field values matched for 37,733 disclosures (97.93%). True if GISCounty matches CountyFFQA. The two field values matched for 36,894 disclosures (95.75%). True if APICounty matches GISCounty. The two field values matched for 37,372 disclosures (96.99%). Other locational fields AIIStateOK AIICountyOK True if all three StateMatch fields are true. The three field values matched for 38,359 disclosures (99.56%). True if all three StateMatch and all three CountyMatch fields are true. The three field values matched for 36,754 disclosures (95.39%). 7.1.2. Ingredient Field Descriptions This section lists the fields in the Originallngredient and Qalngredient tables, which provide information on additives and their ingredients, as well as base fluids and proppants. Ingredientld Wellld TradeName Supplier The unique identifier added to each ingredient the database The unique identifier added to each disclosure database record that was parsed into that was parsed into the The ingredient trade name. A number of trade name values are comma- joined lists of multiple trade names for the entire disclosure. Microsoft Access cannot store many of these long values in a text field, but converting to Memo would increase database size. The ingredient supplier. Supplier values (names) were standardized manually in QAIngredient. Table continued on next page 20 ------- Data Management and Quality Assessment Report March 2015 Purpose ChemicalName Cas EPAIngredientld AdditiveConcent ration FluidConcentration Comments ValidTradeName ValidPurpose Valid AdditiveConcent ration ValidFluidConcentration ValidCas The purpose assigned to a particular ingredient. In table QAIngredient, purpose entries were standardized manually to correct for misspellings, punctuation, hyphenation, and capitalization. The original value parsed from the disclosures, in the Originallngredient table; or the standardized chemical name, where available, in the Qalngredient table The CASRNs of the ingredient as parsed from the disclosures, in the Originallngredient table. In the Qalngredient table, CASRNs have been stripped of non-numeric characters and properly hyphenated, and CASRNs with invalid check digits have been removed. The identifier that links ingredient name standardization in the QAIngredient table with the IngredientNameStandardization table. Records for 796,692 ingredients were matched to an EPAIngredientName. The original "maximum ingredient concentration in additive (% by mass)" parsed from FracFocus disclosures, in the Originallngredient table. In the Qalngredient table, entries expressed as a single decimal value were kept intact, while non-numeric values or ranges for 353,157 values were changed to Null. The original "maximum ingredient concentration in hydraulic fracturing fluid (% by mass)," in the Originallngredient table. Entries expressed as a single decimal value were kept intact, while non-numeric values or ranges for 291,293 values were changed to Null. Comments entered by the operator on the FracFocus disclosure. No changes were made to values in this field. True if the trade name should be regarded as valid. This flag is set based on the TradeNameStandardization table. Values of TradeName appear to not be trade names for 252,361 ingredients; these have been flagged in the QAIngredients table as having an invalid trade name (value of False). True if the purpose should be regarded as valid. This flag is set based on the PurposeStandardization table. Values of Purpose appear not be purposes for 204,123 ingredient records; these have been flagged in the QAIngredients table as having an invalid purpose (value of Fo/se).are clearly not purposes. True if AdditiveConcentration is between 0 and 100. For 356,789 ingredients, this field has been flagged in the Qalngredients table as False (invalid value). True if FluidConcentration is between 0 and 100. For 293,614 ingredients, this field has been flagged in the Qalngredients table as False (invalid value). True if Cas matches a standardized ingredient in the IngredientNameStandardization table. For 433,753 ingredients, this field has been flagged in the Qalngredients table as False (invalid value). 21 ------- Data Management and Quality Assessment Report March 2015 7.2. Data Fields in Tables Associated with Standardizations Several tables store the corrections and standardizations used to develop the QAWell and QAIngredient tables. These standardizations have been conservatively developed to facilitate data analysis. 7.2.1. Chemical Name Standardization The following table lists the fields in the IngredientNameStandardization table. Ingredient names for verified CASRNs were standardized using a list of unique chemical names paired with CASRNs that was developed by the EPA (Section 6). EPAIngredientld EPAIngredientName Cas The primary key for the table, which can be used to join the Qalngredient and IngredientNameStandardization tables The chemical name for the ingredient as determined by the EPA The CASRN corresponding to an individual chemical. The EPA provided unique identifiers in the form of NOCAS_XXXXX (where XXXXX is a numerical identifier) for chemicals without CASRNs. 7.2.2. Operator Standardization Information This section lists the fields of the OperatorStandardization table. Original Standardized The original operator name, found in the Operator field of the Originallngredient table. The OperatorFF field in OriginalWell was joined to this table using this field during the standardization process. The standardized name to use in the Operator field of the Qalngredient table 22 ------- Data Management and Quality Assessment Report March 2015 7.2.3. Trade Name Standardization This section lists the fields of the TradeNameStandardization table, in which trade names were standardized to correct spelling and punctuation and evaluated to identify and flag entries that do not represent additives (e.g., numerical values, purposes, chemical names). Some fields were used in assigning a value to the ValidTradeName field in the Qalngredient table. Other fields provide additional categorization for reference. ID Multiple Entries in Trade Name Field Ingredient (General name) - not proppant Purpose Name Number that looks like possible concentration Possible CASRN Other Count A, B, C, D, E or F May or may not be Trade Name Commodity Proppant (generic or trade name) Suggested spelling or punctuation correction Trade Name as Listed in FracFocus A unique identifier for each row in this table. Checked if the trade name value appears to list multiple trade names. Some operators listed all additives used in one cell. This field is used to determine the value of the ValidTradeName field. Checked if the value appears to be an ingredient. This field is used to determine the value of the ValidTradeName field. Checked if the value appears to be an additive purpose. This field is used to determine the value of the ValidTradeName field. Checked if the value appears to be a chemical concentration (possibly the result of parsing errors). This field is used to determine the value of the ValidTradeName field. Checked if the value appears to be a CASRN. This field is used to determine the value of the ValidTradeName field. Checked if there appears to be another type of problem with the trade name value. This field is used to determine the value of the ValidTradeName field. 1 if any of the above 6 fields are checked, otherwise 0. Checked if it is not readily clear if the entry refers to something other than the trade name Checked if the value of the trade name is a commodity name (e.g., water) Checked if the value appears to indicate a proppant The standardized value of the TradeName field of the Qalngredient table The original value of the TradeName field of the Originallngredient table. The TradeName field in Originallngredient was joined to this table using this field during the standardization process. 7.2.4. Ingredient Purpose Standardization This section lists the fields of the PurposeStandardization table, in which purposes were evaluated to identify and flag entries that do not represent purposes (e.g., numerical values, chemical names, operator names). Some fields were used in assigning a value to the ValidPurpose field in the Qalngredient table. Other fields provide additional categorization for reference; the two fields 23 ------- Data Management and Quality Assessment Report March 2015 referring to proppants were used in querying for proppants and in excluding proppants from additive ingredient analyses. ID Multiple Entries in Purposes Field Ingredient (General Name)(excludes HCI) Commercial Product Name that doesn't include purpose and not IDd Purpose Can Be Inferred from Product Name or From Another Entry Item is Likely a Proppant Other Count B, C, D, E, F, or G Proppant - uses word Proppant or other Identifying Term Purpose corrected for caps, spacing, dashes, misspellings Purpose as Listed in FracFocus Related to Base Flu id Related to Alternative Carrier A unique identifier for each row in this table Checked if the additive purpose value appears to list multiple purposes. Some operators listed the purposes of all additives used in one cell. This field is used to determine the value of the ValidPurpose field. Checked if the value appears to be a chemical ingredient. This field is used to determine the value of the ValidPurpose field. Checked if the value appears to be a trade name of an additive. This field is used to determine the value of the ValidPurpose field. Checked if the purpose be inferred from an additive name or some other purpose entry for another ingredient record. This field is used to determine the value of the ValidPurpose field. Checked if the value appears to indicate a proppant, even though it does not use a common identifying term such as proppant or list one of the chemical names sand, silica, or quartz. This field is used to determine the value of the ValidPurpose field. Checked if there is another type of problem with the additive purpose value. This field is used to determine the value of the ValidPurpose field. 1 if any of the above 6 fields are checked, otherwise 0. Checked if the value appears to indicate a proppant, using the word proppant or listing one of the chemical names sand, silica, or quartz or other identifying term The standardized value of the Purpose field of the Qalngredient table The original value of the Purpose field of the Originallngredient table. The Purpose field in Originallngredient was joined to this table using this field during the standardization process. Checked if the additive purpose appears to be related to the base fluid Checked if the additive purpose appears to be related to a non-water base fluid. The relationship was determined by observation and used for analysis of non- water base fluids. 7.3. Data Fields in Other Tables Several additional tables have been added to the database with lists that were used to support the analyses described in the data analysis report 24 ------- Data Management and Quality Assessment Report March 2015 7.3.1. Proppant Identification This section contains information about the Proppants table, which lists solids (e.g., minerals, ceramics) associated with proppant-related purposes (as parsed from disclosures). Information in this table assisted with excluding the minerals used as proppants from analyses of additive ingredients. ChemicalName Cas OK to exclude The chemical name of the proppant. The ChemicalName field in Qalngredient was joined to this table using this field to identify proppants. The CASRN of the proppant Checked if the chemical can be excluded from the additive ingredient analyses 7.3.2. Resin Coating Identification This section contains information about the ResinCoating table, which lists ingredients parsed from disclosures associated with the additive purpose of resin coatings. This list assisted in capturing the ingredients used for resin coatings on proppants in analyses of additive ingredients. ChemicalName Cas The chemical name of the resin coating. The ChemicalName field joined to this table using this field to identify resin coatings. in Qalngredient was The CASRN of the resin coating 7.3.3. CBI Identification This section contains information about the CBISynonym table, which lists terms used to indicate that an operator has claimed CBI status for an ingredient in the ChemicalName and Cas fields of the Originallngredient table. This table was used for analyzing the numbers of ingredient records in the database that were listed by the operators as CBI. Term A term indicating CBI 7.3.4. Water Source Identification This section contains information about the WaterSourceTerm table, which lists terms in the TradeName and Comments fields of the Originallngredient table that indicate the source of water used for the base fluid (e.g., fresh, recycled). This table was used to query the database for information on water sources. Source A term indicating a water source 25 ------- Data Management and Quality Assessment Report March 2015 7.3.5. Purpose Categorization This section contains information about the PurposeCategorization table, which lists the categories of purposes as found in the Purpose field of the Oalngredient table. This table was used to group ingredients by purpose category. Category Purpose The category of the standardized purpose The standardized purpose 7.3.6. State Regulation Information This section contains information about the State Regulation table, which contains information about state reporting requirements. A single state may have multiple rows when regulations are amended. ID State Reporting Requirement Type EffectiveDate Effective Date within FF DB Timeframe? Notes A unique identifier for each row in this table The name of a state The recipient of required reporting, either the FracFocus registry (FracFocus), the state regulator (State), both FracFocus and the state (FracFocus AND State), or either FracFocus and the state (FracFocus OR State). The effective date of the state regulation. Either Y if the date is between 1/1/2011 and 2/28/2013 or N otherwise. Notes about the regulation, including relevant limitations. 26 ------- Data Management and Quality Assessment Report March 2015 7.3.7. County Information This section contains information about the Counties table, which contains information about counties. STATE COUNTY FIPS STATE_FIPS County Name StateName CaseStudy The state abbreviation The full name of a county (e.g., Clay County) The The The The county FIPS code state FIPS code short name of a county (e.g., Clay) name of a state Identifies whether the county is a focus county in the data analysis report 7.3.8. Water Synonyms This section contains information about the WaterSynonyms table, which contains a list of synonyms for an unknown water source. TradeName A synonym for an unknown water source 7.3.9. UnparsedPDFs This section contains information about the UnparsedPDFs table, which lists the 606 PDF files that could not be successfully parsed (Table 1). PDFName API_Final Data Storage Error State The PDF filename of the unparsed disclosure The API well number, as extracted from PDFName Identifies 14 disclosures that GWPC indicated should be excluded from the database because of a data storage error project The state in which the disclosure is located, based on the API well number 27 ------- Data Management and Quality Assessment Report March 2015 8. Summary The project database was developed from PDF disclosures given to the EPA by the GWPC and submitted to the FracFocus Chemical Disclosure Registry 1.0 before March 1, 2013. Data from the PDF files were converted to XML format, parsed, and incorporated into a Microsoft Access database. The data in the project database were then subject to QA procedures to ensure that the results from analyses of the project database reflect the data contained in the original PDF disclosures, while identifying obviously invalid or incorrect data to exclude from analyses. A conservative approach was used in all data handling; no records were deleted and the original data remain in the project database. To improve the results of analyses, data have been subject to minimal standardization of operator names, trade names, and purposes, as well as standardization of chemical names according to CASRNs. The standardized entries are included in the two "Qa" tables. During QA work on the project database, data limitations were encountered, and QA flag fields were developed to identify agreement among locational data and instances of problematic data. During data analysis, database queries and subsequent calculations were structured to compensate for these limitations. Results of analyses conducted on the project database are presented in the Analysis of Hydraulic Fracturing Fluid Data from the FracFocus Chemical Disclosure Registry 1.0 (US EPA, 2015). 28 ------- Data Management and Quality Assessment Report March 2015 References Adobe Systems Incorporated. 2011. Adobe AcrobatX Pro 10. Arthur, JD, Layne, MA, Hochheiser, HW, and Arthur, R. 2014. Spatial and Statistical Analysis of Hydraulic Fracturing Activities in US Shale Plays and the Effectiveness of the FracFocus Chemical Disclosure System. SPE Hydraulic Fracturing Technology Conference, The Woodlands, Texas, February 4-6. Society of Petroleum Engineers. Carter, KE, Hakala, JA, and Hammack, RW. 2013. Hydraulic Fracturing and Organic Compounds - Uses, Disposal and Challenges. SPE Eastern Regional Meeting, Pittsburgh, Pennsylvania, August 20- 22. Society of Petroleum Engineers. Chemical Abstracts Service. 2014. Check Digit Verification of CAS Registry Numbers. Available at http://www.cas.org/content/chemical-substances/checkdig. Accessed April 21, 2014. Drillinglnfo, Inc. 2011. DI Desktop December 2011 download. Esri,Inc. 2012. ArcGIS 10.1. Microsoft Corporation. 2013. Excel 2013. Microsoft Corporation. 2012. Access 2013. Python Software Foundation. 2012. Python 2.7. Richardson, L. 2013. Beautiful Soup 4. Society of Petrophysicists and Well Log Analysts. 2010. API Standards Information. Available at http://www.spwla.org/technical/api-codes. Accessed April 21, 2014. US Energy Information Administration (US EIA). 2007. Data for the Coalbed Methane Panels. Oil- and Gas-Related Maps, Geospatial Data, and Geospatial Software. Available at http://www.eia.gov/ pub/oil_gas/natural_gas/analysis_publications/maps/maps.htm. Accessed April 18, 2014. US EIA. 2011a. Data for the Tight Gas Plays Map. Oil- and Gas-Related Maps, Geospatial Data, and Geospatial Software. Available at http://www.eia.gov/pub/oil_gas/natural_gas/ analysis_publications/maps/maps.htm. Accessed April 18, 2014. US EIA. 2011b. Data for the US Shale Plays Map. Oil- and Gas-Related Maps, Geospatial Data, and Geospatial Software. Available at http://www.eia.gov/pub/oil_gas/natural_gas/ analysis_publications/maps/maps.htm. Accessed April 18, 2014. US Environmental Protection Agency (US EPA). 2012. Study of the Potential Impacts of Hydraulic Fracturing on Drinking Water Resources: Progress Report EPA601/R-12/011. US Environmental Protection Agency, Washington, DC. 278 pages. US EPA. 2013. Distributed Structure-Searchable Toxicity (DSSTox) Database Network. Available at http://www.epa.gov/ncct/dsstox/index.html. Accessed April 21, 2014. 29 ------- Data Management and Quality Assessment Report March 2015 US EPA. 2014. Substance Registry Services. Available at http://ofmpub.epa.gov/sorjnternet/ registry/substreg/home/overview/home.do. Accessed April 21, 2014. US EPA. 2015. Analysis of Hydraulic Fracturing Fluid Data from the FracFocus Chemical Disclosure Registry 1.0. EPA/600/R-1/003. US Environmental Protection Agency, Washington, DC. 168 pages. US National Library of Medicine (US NLM). 2014. ChemID Plus Advanced. Available at http://chem.sis.nlm.nih.gov/chemidplus. Accessed April 21, 2014. US Census Bureau (USCB). 2011. Toplogically Integrated Geographic Encoding and Referencing (TIGER)/Line Shapefiles. Available at ftp://ftp2.census.gov/geo/tiger/TIGER2010/COUNTY/2010. Accessed September 16, 2013. US Geological Survey (USGS). 1995. Province Boundaries shapefile. National Oil and Gas Assessment Available at https://catalog.data.gov/dataset/1995-national-oil-and-gas-assessment- province-boundaries. Accessed April 18, 2014. 30 ------- Data Management and Quality Assessment Report March 2015 [This page intentionally left blank.] 31 ------- &EPA United States Environmental Protection Agency ------- |