United States
Environmental
Protection Agency
Development of a Water Solubility Dataset to
Establish Best Practices for Curating New Datasets
for QSAR Modeling
Charles Lowe, and Antony Williams
U.S. Environmental Protection Agency, Office of Research and Development, Center for Computational Toxicology and Exposure,
Research Triangle Park, NC
ChemCuration 2019
December 3rd, 2019
ORCID: 0000-0001-9151-6157
Charles Lowe I lowe.charles@epa.gov I 919-541-5618
Problem Definition and Goals
Problem: There are numerous peer-reviewed publications and public
websites that contain experimental data that could be used to improve
existing QSAR/QSPR models. Commonly these data are not available in an
ideal form: often limited to PDF supplementary info files for publications (with
names or CASRNs and no electronic structure formant. However, when
aggregation of these data has been attempted curation has been necessary.
Goals: Provide a de facto dataset for water solubility data that can be used
to build multiple models and eventually a consensus model. Identify specific
sets of chemicals that can improve existing models. Curate these data to
ensure chemical identifiers represent the same chemical structure,
physicochemical property data has consistent units, etc. Make these data
available as downloadable data for use in QSAR/QSPR models and reuse in
other databases. The project was started using aqueous solubility data
available from the QCHEM database (https://ochem.eu/).	
Abstract
The U.S. Environmental Protection Agency's CompTox Chemicals Dashboard
(https://comptox.epa.gov/dashboard) hosts a plethora of environmentally-
relevant chemical information, including physical property data suitable for
QSAR/QSPR modeling. The development of these physical property
datasets has generally involved the curation of publicly-available experimental
data. The ease of accessing this data, along with the overall quality of the
dataset (i.e. machine-readable formatting, inclusion of experimental
conditions, etc) is highly variable. This purpose of this work is to identify the
challenges associated with acquiring physical property datasets, with a focus
on obtaining water solubility values for organic compounds. Common issues
discovered in this data will be presented, along with solutions that can be
easily implemented in a high-throughput manner. The end result will be a
standard workflow a researcher can follow when curating physical property
datasets. This abstract does not necessarily represent the views or policies of
the U.S. Environmental Protection Agency.	
Issues Discovered During Aggregation and Curation
r.-c
3g(mol/L


" :( ";i- ' ~
[S7]Cl=[NH0]C(=(NH0]C([ff71)=[NH0]
[M]N([S6]|CCN(CC1=CC=C|SH0!11C1
:(c=da)a=cckHc(cl)=i m
Dg(mol/l
3(mol/L)
Dg(mol/L;
~g(mol/L
og(mol/L;
CC(=0)C1=CC=C(C=C1)C1=
[0-][N+](=0)C1=CC=C(C=C1)C1=CC=0
Cl=CC=C(C=Cl)a=CC=C(C=Cl)Cl=CC=92-
og(mol/l)
11278 NA
tRPHENYL
$5SPHTHALENYL)ETHANONE
COCl=C(OCC(0)CO)C=
COCl=C(OC)C=C(CC=C)C=Cl
[M]Ci=CC2=iNH01C(IS6I)=CC=C2C=a93
CASRN recorded
Chemical Name
m
¦•jai-'x-!
2.-a^-:f:t'^cruph
[B6]0C(=0)Cl=CC=C[NH6j=
CC(0C1=CC=C(CI)C=C1C|C(0|=0
CC( 0C1=CC( cl)=cf ci) c=cicl)c(o)=o
0C(=0)C0C1=CC(CI)=C(CI)C=C1CI
oc(=o)cccoci=cc(ci)=c(ci)c=aci
CC(=0)CC(=O)Cl=CC=CC=Cl
0=C{0C(=0)C1=CC=CC=C1)C1=(
Left: Issues like SMILES being
represented instead by SMARTS and
chemical names containing other
identifiers like CASRN are numerous.
Right: A well-known issue where Excel
converts to CASRN to dates can be
solved by using the function
=TEXT(CASRN value, "yyyy-mm-d") in an
adjacent cell. It's also common to see
chemical names truncated with
significant information loss.
CC|C)COC(=0)C1=C(C=CC=C1C(0)=0)16744-88-3 PhttliufcE
CAS numbers converted to dates
Truncated name

Jl/-
C0C(=0)NS(=0)(=0)C1=
C0C(=0)C1=CC=CC=C1C(=
0C(=O)CNC(=0)Cl=CC=CC=Cl
Inconsistent representation of stereochemistry):
log(mol/L)


og(mol/L)
og(mol/l)
a=c(c=cc=c2)c(c)=ci in
>NC2=C1C=CN2C10C(C0)C(0 69
cc(c)=cecc(c)(o)c=c	17
)NS(=0)(=0)C1=CC=C(N)C=C1 5
CC1(C)C2CCC1(C)C(=0)C2	7
FC1=CC(F)=CC=C1
C0C1=C( o)c=cc(=ci)c(o)=o
CC(C)=CCCC(C)=CC=0
c=ccci(cc=c)C(=o)nc(=o)nci=o
Ciqcl)C2CC(C#N)ClC2 1!

CH3
[?



h3c

36 log(mol/l)
36	log(mol/L)
37	log(mol/L)
39 log(mol/L)
-2 log(mol/L)
-2 log(mol/Lj
33 log(mol/L)
35log(mol/L)

CH,
/
cr
T^CH3
V

h3c

Above, Left: There are issues where stereochemical information is present
in some chemical identifiers and not in others for a certain value. This can
lead to issues for specific endpoints that are dependent on differences in
stereochemistry. QSAR/QSPR models may not specifically take
stereochemistry into account but registration of experimental data would.
CCN[Ptl(CI)(CI)NCC	NA
CCN[Pt)(0)(0)(0)(0)NCC	NA
0(Pt]l(0)(0)(0)NC2CCCCC2Nl NA
N(Pt]l(N)(0C(=0)CCqo)=0)(OC(=O) NA
(=0)CCC(=0)0[Pt]l(N)(N)(0C(=CNA
NlPtll(N
N[Pt]l(N
Leading zeros on
CAS numbers
NfPtl 1(N)(0C(=0)CCC(=0)NC2CCCC2: NA
N[Pt]123(N)0C(=0)CC(=O)Ol.O=C(CCNA
0=C1CC(=0)0[PI]23(NCCN2)(01)0C(. NA
0=C10[Pt)23(NCCN2)(0C(=0)CllCCC NA
NJPtj 123( NC4CCCCC4)0C(=0)CC(=0)I NA
C\C=C\C(0)=0	37;
I=CC=C(0)C(0)=C1 51-
AORENAUNE TARTRATE (1:1)
N(C@H](CCCn:[c+](:[nj):[nl)C([0-])=CN,
Invalid property data
incorrectly converted to
valid format

mol/L)
mol/L)
Invalid InChiKeys

mol/L)
Above, Right: Other problems include the inclusion of superfluous text such
as leading zeros on CAS numbers and incorrect data types in certain data
columns (e.g. invalid InChiKeys). Property data that has undergone a unit
conversion can have problems such as too many significant digits and
incorrect values due to improper entry of the original dataset.
Future Work
DETAILS
EXECUTIVE SUMMARY
ENV. FATE/TRANSPORT
HAZARD
•	ADME
' EXPOSURE
•	BIOACTIVITY
SIMILAR COMPOUNDS
GENRA (BETA)
RELATED SUBSTANCES
SYNONYMS
•	LITERATURE
Bisphenol A
80-05-7 ] DTXSID7020182
J Searched by DSSTox Substance Id.
I Tetko. Igor V.. et al. "Estimation of aqueous solubility of chemical compounds I
| using E-state indices.". I. Chem. Inf. and Comp. Sti. 41.6 (2001); 1488-1493
Tetko et a J	t	-ICS 149:
Kovdienko. et al. Molecular infprmatics 29.5 (2010): 394-406.
Water Solubility
Range
5.25e-4 to 1.51e-3
5.35e-4to1.31e-3
5.26e-4
-1.51e-3
Left: Curated physicochemical property data in the
CompTox Chemicals Dashboard appears in the
Experimental Section under the Properties Tab for each
property endpoint associated with that chemical.
•	We continue to harvest and curate experimental data
for display of multiple physicochemical and fate and
transport properties on the Dashboard.
•	Retrain existing QSAR models using this newly
harvested property data and note any improvements
this approach offers.
•	New data harvested to date will be displayed in the
March 2020 release of the dashboard.
References
1. Sushko, lurii, et al. "Online chemical modeling environment (OCHEM):
web platform for data storage, model development and publishing of
chemical information." Journal of computer-aided molecular design
25.6 (2011): 533-554.
www.epa.gov/research
Innovative Research for a Sustainable Future

-------