United States
Environmental
Protection Agency
Development of a Water Solubility Dataset to Establish Best Practices for acsSt2ual A
Curating New Datasets for QSAR Modeling	August 1720 2020 M
Charles Lowe1, Chris Grulke1 and Antony Williams1	^
'U.S. Environmental Protection Agency, Office of Research and Development, Center for Computational Toxicology and Exposure, Research Triangle Park, NC
ORCID: 0000-0001-9151-6157
Charles Lowe I lowe.charles@epa.gov I 919-541-5618
Problem Definition and Goals
Problem: There are numerous peer-reviewed publications and public
websites that contain experimental data that could be used to
improve existing QSAR/QSPR models. Commonly these data are not
available in an ideal form: often limited to PDF supplementary info
files for publications (with names or CASRNs and no electronic
structure formant). However, when aggregation of these data has
been attempted curation has been necessary.
Goals: Provide a de facto dataset for water solubility data that can
be used to build multiple models and eventually a consensus model.
Identify specific sets of chemicals that can improve existing models.
Curate these data to ensure chemical identifiers represent the same
chemical structure, physicochemical property data has consistent
units, etc. Make these data available as downloadable data for use
in QSAR/QSPR models and reuse in other databases.
Simplified Workflow for Dataset Assembly
Data Sources
Curation of chemical
identifiers
Water solubility measurement
Experimental conditions (temperature, pH)
Exclusion criteria:
temperatures outside (20 - 30 C)
pH (6.5 - 7.5)
QSAR-ready
SMILES
Unit standardization
Dataset suitable for modeling
Figure 1: This diagram shows the simplified workflow used in the
assembly of the water solubility dataset. Note that chemical structure
is represented using QSAR-ready SMILES - a SMILES
representation of the desalted, de-isotoped, stereo-neutral forms of
chemical structures associated with particular chemical substances.
Article Identifier
Original No. of
Chemicals
No. of QSAR-ready
SMILES
https://doi.orq/10.1021/ci700307p
287
286
https://doi.orq/10-1002/minf.201000001
2810
2596
https://doi.orq/10.1016/S0045-6535(02)00118-2
1719
1530
https://doi.orq/10.1080/10807039.2015.1133242
1190
1155
https://doi.orq/10.1186/s13321 -017-0250-y
100
99
https://doi.orq/10.6084/m9.fiqshare.1514952.v1
3315
1836
httosY/www.ebi.ac.uk/chembl/
326
323
Table 1: Current articles and databases assembled using the
workflow shown in Figure 1. The number of QSAR-ready SMILES
denotes the unique chemical structures available for modeling after
curation and solubility measurement cleaning.
I Simplified Modeling Workflow

Dataset suitable for modeling







Generate PaDeL descriptors

Remove insoluble compounds
Take log10 of solubility values

Remove descriptors with missing
values & those which are highly-
correlated
For n > 1
average solubility values
if SD > 0.25, exclude chemicals
Split 75% of data into training set and
25% into test set
Maintain endpoint distribution
Random forest
Weighted K-nearest neighbors
Gradient boosting
Figure 2 (above): This diagram shows the simplified workflow used in
the modeling of the water solubility dataset.
Table 2 (below): Modeling approaches with performance values for
Model Name
Training Dataset (5-fold
CV)
Test Dataset
External Test Dataset
Dataset Size: 3153
Dataset Size: 1049
Dataset Size: 4224
RMSE
R2
RMSE
R2
RMSE
R2
Weighted K-Nearest
Neiqhbors
0.95
0.82
0.98
0.81
0.76
0.89
Gradient Boostinq
0.84
0.86
0.90
0.84
0.68
0.91
Random Forest
0.89
0.85
0.92
0.84
0.69
0.91
Dataset and Model Performance Metrics
III.

Figure 3: Count of the instances
of chemicals in different data;
sources (above). Distribution of
water solubility values in training
and test sets (right).
Atrazine
. 1912-24-9 | DTXSID9020112
Figure 4: Selection of PaDeL descriptors via recursive feature
elimination for K-NN model (top left). Correlation of predicted and
experimental values in training set for models (remaining plots).
Figure 5 (left): Experimental water solubility
data available on the CompTox Chemicals
Dashboard.2 	
Supplemental Information
External test set is a curated version of the
PhysProp dataset developed for EPI Suite,
https://www.epa.aov/tsca-screeninq-tools/epi-
suitetm-estimation-proq ram-interface.
https://comptox.epa.qov/dashboard/dsstoxdb/
results?search=DTXSID9020112#properties
and select "Water Solubility" from the
dropdown menu.
www.epa.gov/research
Innovative Research for a Sustainable Future
The views expressed are those of the authors and do not necessarily represent the policies of the US EPA

-------