United States Environmental Protection Agency Development of a Water Solubility Dataset to Establish Best Practices for Curating New Datasets for QSAR Modeling Charles Lowe, and Antony Williams U.S. Environmental Protection Agency, Office of Research and Development, Center for Computational Toxicology and Exposure, Research Triangle Park, NC ChemCuration 2019 December 3rd, 2019 ORCID: 0000-0001-9151-6157 Charles Lowe I lowe.charles@epa.gov I 919-541-5618 Problem Definition and Goals Problem: There are numerous peer-reviewed publications and public websites that contain experimental data that could be used to improve existing QSAR/QSPR models. Commonly these data are not available in an ideal form: often limited to PDF supplementary info files for publications (with names or CASRNs and no electronic structure formant. However, when aggregation of these data has been attempted curation has been necessary. Goals: Provide a de facto dataset for water solubility data that can be used to build multiple models and eventually a consensus model. Identify specific sets of chemicals that can improve existing models. Curate these data to ensure chemical identifiers represent the same chemical structure, physicochemical property data has consistent units, etc. Make these data available as downloadable data for use in QSAR/QSPR models and reuse in other databases. The project was started using aqueous solubility data available from the QCHEM database (https://ochem.eu/). Abstract The U.S. Environmental Protection Agency's CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) hosts a plethora of environmentally- relevant chemical information, including physical property data suitable for QSAR/QSPR modeling. The development of these physical property datasets has generally involved the curation of publicly-available experimental data. The ease of accessing this data, along with the overall quality of the dataset (i.e. machine-readable formatting, inclusion of experimental conditions, etc) is highly variable. This purpose of this work is to identify the challenges associated with acquiring physical property datasets, with a focus on obtaining water solubility values for organic compounds. Common issues discovered in this data will be presented, along with solutions that can be easily implemented in a high-throughput manner. The end result will be a standard workflow a researcher can follow when curating physical property datasets. This abstract does not necessarily represent the views or policies of the U.S. Environmental Protection Agency. Issues Discovered During Aggregation and Curation r.-c 3g(mol/L " :( ";i- ' ~ [S7]Cl=[NH0]C(=(NH0]C([ff71)=[NH0] [M]N([S6]|CCN(CC1=CC=C|SH0!11C1 :(c=da)a=cckHc(cl)=i m Dg(mol/l 3(mol/L) Dg(mol/L; ~g(mol/L og(mol/L; CC(=0)C1=CC=C(C=C1)C1= [0-][N+](=0)C1=CC=C(C=C1)C1=CC=0 Cl=CC=C(C=Cl)a=CC=C(C=Cl)Cl=CC=92- og(mol/l) 11278 NA tRPHENYL $5SPHTHALENYL)ETHANONE COCl=C(OCC(0)CO)C= COCl=C(OC)C=C(CC=C)C=Cl [M]Ci=CC2=iNH01C(IS6I)=CC=C2C=a93 CASRN recorded Chemical Name m ¦•jai-'x-! 2.-a^-:f:t'^cruph [B6]0C(=0)Cl=CC=C[NH6j= CC(0C1=CC=C(CI)C=C1C|C(0|=0 CC( 0C1=CC( cl)=cf ci) c=cicl)c(o)=o 0C(=0)C0C1=CC(CI)=C(CI)C=C1CI oc(=o)cccoci=cc(ci)=c(ci)c=aci CC(=0)CC(=O)Cl=CC=CC=Cl 0=C{0C(=0)C1=CC=CC=C1)C1=( Left: Issues like SMILES being represented instead by SMARTS and chemical names containing other identifiers like CASRN are numerous. Right: A well-known issue where Excel converts to CASRN to dates can be solved by using the function =TEXT(CASRN value, "yyyy-mm-d") in an adjacent cell. It's also common to see chemical names truncated with significant information loss. CC|C)COC(=0)C1=C(C=CC=C1C(0)=0)16744-88-3 PhttliufcE CAS numbers converted to dates Truncated name Jl/- C0C(=0)NS(=0)(=0)C1= C0C(=0)C1=CC=CC=C1C(= 0C(=O)CNC(=0)Cl=CC=CC=Cl Inconsistent representation of stereochemistry): log(mol/L) og(mol/L) og(mol/l) a=c(c=cc=c2)c(c)=ci in >NC2=C1C=CN2C10C(C0)C(0 69 cc(c)=cecc(c)(o)c=c 17 )NS(=0)(=0)C1=CC=C(N)C=C1 5 CC1(C)C2CCC1(C)C(=0)C2 7 FC1=CC(F)=CC=C1 C0C1=C( o)c=cc(=ci)c(o)=o CC(C)=CCCC(C)=CC=0 c=ccci(cc=c)C(=o)nc(=o)nci=o Ciqcl)C2CC(C#N)ClC2 1! CH3 [? h3c 36 log(mol/l) 36 log(mol/L) 37 log(mol/L) 39 log(mol/L) -2 log(mol/L) -2 log(mol/Lj 33 log(mol/L) 35log(mol/L) CH, / cr T^CH3 V h3c Above, Left: There are issues where stereochemical information is present in some chemical identifiers and not in others for a certain value. This can lead to issues for specific endpoints that are dependent on differences in stereochemistry. QSAR/QSPR models may not specifically take stereochemistry into account but registration of experimental data would. CCN[Ptl(CI)(CI)NCC NA CCN[Pt)(0)(0)(0)(0)NCC NA 0(Pt]l(0)(0)(0)NC2CCCCC2Nl NA N(Pt]l(N)(0C(=0)CCqo)=0)(OC(=O) NA (=0)CCC(=0)0[Pt]l(N)(N)(0C(=CNA NlPtll(N N[Pt]l(N Leading zeros on CAS numbers NfPtl 1(N)(0C(=0)CCC(=0)NC2CCCC2: NA N[Pt]123(N)0C(=0)CC(=O)Ol.O=C(CCNA 0=C1CC(=0)0[PI]23(NCCN2)(01)0C(. NA 0=C10[Pt)23(NCCN2)(0C(=0)CllCCC NA NJPtj 123( NC4CCCCC4)0C(=0)CC(=0)I NA C\C=C\C(0)=0 37; I=CC=C(0)C(0)=C1 51- AORENAUNE TARTRATE (1:1) N(C@H](CCCn:[c+](:[nj):[nl)C([0-])=CN, Invalid property data incorrectly converted to valid format mol/L) mol/L) Invalid InChiKeys mol/L) Above, Right: Other problems include the inclusion of superfluous text such as leading zeros on CAS numbers and incorrect data types in certain data columns (e.g. invalid InChiKeys). Property data that has undergone a unit conversion can have problems such as too many significant digits and incorrect values due to improper entry of the original dataset. Future Work DETAILS EXECUTIVE SUMMARY ENV. FATE/TRANSPORT HAZARD • ADME ' EXPOSURE • BIOACTIVITY SIMILAR COMPOUNDS GENRA (BETA) RELATED SUBSTANCES SYNONYMS • LITERATURE Bisphenol A 80-05-7 ] DTXSID7020182 J Searched by DSSTox Substance Id. I Tetko. Igor V.. et al. "Estimation of aqueous solubility of chemical compounds I | using E-state indices.". I. Chem. Inf. and Comp. Sti. 41.6 (2001); 1488-1493 Tetko et a J t -ICS 149: Kovdienko. et al. Molecular infprmatics 29.5 (2010): 394-406. Water Solubility Range 5.25e-4 to 1.51e-3 5.35e-4to1.31e-3 5.26e-4 -1.51e-3 Left: Curated physicochemical property data in the CompTox Chemicals Dashboard appears in the Experimental Section under the Properties Tab for each property endpoint associated with that chemical. • We continue to harvest and curate experimental data for display of multiple physicochemical and fate and transport properties on the Dashboard. • Retrain existing QSAR models using this newly harvested property data and note any improvements this approach offers. • New data harvested to date will be displayed in the March 2020 release of the dashboard. References 1. Sushko, lurii, et al. "Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information." Journal of computer-aided molecular design 25.6 (2011): 533-554. www.epa.gov/research Innovative Research for a Sustainable Future ------- |