xvEPA
United States
Environmental Protection
Agency
Predicting Chromatography-tandem Mass Spectrometry
Amenability to Improve Non-targeted Analysis
Charles Lowe1, Kristin Isaacs1, Chris Grulke1, Jon Sobus1, Elin Ulrich1, Alex Chao1'2, and
Antony J. Williams1
1. Center for Computational Toxicology and Exposure, U.S. Environmental Protection Agency, Research Triangle Park, NC
2. Oak Ridge Institute of Science and Education (ORISE) Research Participant, Research Triangle Park, NC
Office of Research and Development
Center for Computational Toxicology and Exposure

-------
SEPA
United States
Environmental Protection
Agency
Disclaimer: The views expressed in this presentation are
those of the authors and do not necessarily reflect the
views or policies of the U.S. Environmental Protection
Agency. This presentation has not been reviewed for
policy and is not for distribution.
Office of Research and Development
Center for Computational Toxicology and Exposure

-------
???
What are we trying to model?
United States
Environmental Protection
Agency
Mass Spectrum
28
Mai
	.ill..
1202224
147 15^52
¦ ¦i I ¦».
https://images.app.goo.gl/ftRmhwxEtZs95uKv7

-------
&EPA
United States
Environmental Protection
Agency
120
100

60
"S 40
What are we trying to model?
For more details on NTA at EPA, please see:
EPA's research initiatives on non-targeted
analyses of environmental chemicals
PRESENTER: Jon Sobus
PAPER ID: 3428870

T5
X.
28
2#Y«ai
	.ill..
a?
liKJ °.il


|YpZ 73J8^|
12^2224
1323 II 140 147 15^52
	..III.. 	111 ¦ ¦
20
40
60
80
100
120
140
160
180
200
m/z
https://images.app.goo.gl/ftRmhwxEtZs95uKv7

-------
&EPA
United States
Environmental Protection
Agency
The more data, the better (most of the t me..)




¦
MoNA - MassBank of North America LiilL Spectra »
£%
Downloads
A
Upload ©Help-'



Searck..
A Downloads
A set of commonly referenced predefined queries. Clicking the name of the query will display the associated spectra in the query browser. Each query is also available to download
in either the MoNA internal JSON format or as NIST MS Search compatible MSP files.
U Display Hidden Downloads
Q. All Spectra (659,728 spectra)
£ Download
0 Q In-Silico Spectra (490,087 spectra)
& Download
0 Q. Experimental Spectra (169,641 spectra)
Download
0 Q GC-MS Spectra (18,883 spectra)
£ Download
& Q. LC'-MS Spectra (133,301 spectra)
Download
& Q, LC-MSMS Spectra (125,833 spectra)
i Download
[1Q LC-MS MS Positive Mode (86,576 spectra)
Download
0Q. LC-MS MS Negative Mode (38,475 spectra)
Download
772 compounds in derivatized GCMS
7,199 compounds in non-derivatized GCMS
3,549 compounds in ESI+ LCMS
2,630 compounds in ESI- LCMS
Office of Research and Development
Center for Computational Toxicology and Exposure

-------
&EPA
United States
Environmental Protection
Agency
Caffeine
Caffeine
HqC.
Originally submitted to the MassBank High Quality Mass Spectral Database
H„C,
A
Q, instrument type
Q instrument
Q collision energy
Q, ionization
Q ionization mode
Q ms level
Q precursor m z
Q precursor type
Q accession
Q publication
Originally submitted to the RIKEN MS 11 Spectral Database for Phytochetnicals
Score: ~ ~ ~ ~ ~
Q instrument
Pegasus EI TOF-MS system.
Q instrument type
GC-EI-TOF
Q ms level
MSI
Q retention index
1880.2430
Q retention time
724.344 sec
Q ionization mode
positive
Q accession
OUF00133
Q date
20ie.01.19 (Created 2010....
Q, author
Tsujimoto Y Tsugawa H= B..
Q license
CC BY-SA
Score: "A" ^ ik
QqQ
Micromass Quattromicro
15eV
ESI
positive
MS2
194.9000
[M-H]+
PM018511
Alonso-Salces KM, Guillou

-------
S rpA
Describi ng structures for modeling
United States
Environmental Protection
Agency

Software News and Update
PaDEL-Descriptor: An Open Source Software to
Calculate Molecular Descriptors and Fingerprints
CHUN WEI YAP
Department of Pharmacy, Pharmaceutical Data Exploration Laboratory,
National University of Singapore. Singapore
Received 17 May 2010; Revised 22 August 2010; Accepted 12 October 2010
DOl 10.I002!jcc.2J707
Published online 17 December 2010 in Wiley Online Library (wiley online library.com).
1,444 ID & 2D Molecular descriptors from QSAR-ready SMILES. Examples include.
-Electrotopological state
-McGowan volume (van der Waals volume)
-molecular linear free energy relationships
-Atom, bond, & ring counts
-LogP predictions, etc..
Office of Research and Development
Center for Computational Toxicology and Exposure

-------
S pp/y
Reduction of descriptor space
United States
Environmental Protection
Agency
Dimension reduction will improve our models and make calculations quicker
1.	Remove any constant descriptors (variance(x) = 0)
2.	Remove nearly constant descriptors (SD < 0.25)
-	0.25 gives a good balance between reduction and retention
3.	Calculate pair-wise correlations between remaining descriptors
-	Eliminate based on a cutoff = 0.96 correlation
1,444 descriptors -> 385 descriptors
Office of Research and Development
Center for Computational Toxicology and Exposure

-------
&EPA
United States
Environmental Protection
Agency
Datasets suitable for modeling
Models need both training and test data
•	75% of data for training, 25% for testing
-Data stratified to maintain proportions in outcome variable
-Different for each model
-InChlKey skeleton as identifier
•	External validation datasets
-EPA's NTA Collaborative Trial (ENTACT)
data (explicitly removed from train/test sets)
i brary(readxl)
i brary(caret)
i brary(randomForest)
i brary(funModeli ng)
i brary(ti dyverse)
i brary(GA)
i braryCAdaSampli ng)
i braryCwsrf)
i brary(rsample)
i brary(dbscan)
R libraries used in study
8
Office of Research and Development
Center for Computational Toxicology and Exposure

-------
s CDA
Learning approach
United States
Environmental Protection
Agency
Four models
-GC (derivatized), GC (not derivatized)
ESI+ LC, ESI- LC
Random forest (will explain)
-Downsample absence data to match
count of presence data
-Optimize mtry and ntree via grid search
-5-fold cross validation
-Y-randomization
9
Office of Research and Development
Center for Computational Toxicology and Exposure
Random Forest Simplified
Random Forest
Tree-1
Instance
Tree-2
Class-A
Class-B
	i		
Majority-Voting
Final-Class

-------
&EPA
United States
Environmental Protection
Agency
Choosing the correct descriptors to pred ct the endpoint
Random Forest Algorithm
Training set X = x1x2...xn with responses
Y=yiy2-yn
For b =
1.	Sample, with replacement, training
examples from X, Y; Xb, Yb.
2.	Train a classification tree/b on Xb, Yb.
3.	The majority of all fb classifies unseen
samples.
Office of Research and Development
Center for Computational Toxicology and Exposure
Creamer?
a

Artwork?

J
Q

-------
United States
Environmental Protection I " C^wl wl V wl O LQ
Agency
•	Classification models need negative data, in addition to positive data
-labs do not report chemicals NOT seen, only those identified by the instrument
•	How do we provide a model with negative data?
-produce the negative data ourselves (but note it is expensive)
-assume all chemicals not present are absent
-make assumption(s) as to what WAS tested
•	For now, let's assume that if a chemical is detected in either ESI+/-> then it has also
been tested in the other mode
•	Still exploring reasonable assumptions for GCMS
11
Office of Research and Development
Center for Computational Toxicology and Exposure

-------
S rnft
teas, protection Descri ptor I in porta nee
Agency	•	1
MDE0.11
Important descriptor descriptions
•	MDEO-11
-molecular distance edge between
all primary oxygens
•	MLFER-A
-overall or summation solute
hydrogen bond acidity
•	SHsOH & maxHsOH
-electrotopological state with
respect to -OH fragments
•	nN
-the number of N atoms...
Office of Research and Development
Center for Computational Toxicology and Exposure
SHsOH
MLFER_A
maxHsOH
SHBd
maxHBd
nAcid
nsOH
minHsOH
minsOH
ATSCIe
SHBint2
maxHBint2
nN
minaasC
minHBint2
GATS2c
minHBd
maxsssN
minssCH2
GATSIs
nHBint2
minaaCH
ATSC4I
maxHBint4
SHBint4
minHBint4
SssNH
ATSC3i
ATSC2p
nN
SHsOH
maxHsOH
MDE0.11
nAtomLAC
nAcid
nBase
AATSCOi
minsOH
MLFER_A
ATSC2v
nAtomLC
nsOH
nssCH2
hmax
MDE0.12
ATSC2p
ATSCIe
ATSCIm
SssNH
nRotBt
minHsOH
Kier2
SsNH2
maxHBd
ATSCIi
minaasC
SaasC
minaaCH
AATS3e

-------
&EPA
United States
Environmental Protection
Agency
Model results
ESI+ Not Downsampled

Reference
Prediction
Present
Absent
Present
2409
252
Absent
291
716
Sensitivity
0.8922
Specificity
0.7397
Balanced Accuracy
0.8159
13
Office of Research and Development
Center for Computational Toxicology and Exposure
ESI+ Downsampled

Reference
Prediction
Present
Absent
Present
2273
388
Absent
171
836
Sensitivity
0.9300
Specificity
0.6830
Balanced Accuracy
0.8065

-------
&EPA
United States
Environmental Protection
Agency
Model results
ESI- Not Downsampled

Reference
Prediction
Present
Absent
Present
1659
305
Absent
291
1413
Sensitivity
0.8508
Specificity
0.8225
Balanced Accuracy
0.8366
14
Office of Research and Development
Center for Computational Toxicology and Exposure
ESI- Downsampled

Reference
Prediction
Present
Absent
Present
1649
315
Absent
271
1433
Sensitivity
0.8589
Specificity
0.8198
Balanced Accuracy
0.8393

-------
Internal test set results
United States
Environmental Protection
Agency
ESI+ Not Downsampled

Reference
Prediction
Present
Absent
Present
804
114
Absent
84
220
Sensitivity
0.9054
Specificity
0.6587
Balanced Accuracy
0.782
15
Office of Research and Development
Center for Computational Toxicology and Exposure
ESI+ Downsampled

Reference
Prediction
Present
Absent
Present
767
65
Absent
121
269
Sensitivity
0.8637
Specificity
0.8054
Balanced Accuracy
0.8346

-------
Internal test set results
United States
Environmental Protection
Agency
ESI- Not Downsampled

Reference
Prediction
Present
Absent
Present
551
104
Absent
115
452
Sensitivity
0.8273
Specificity
0.8129
Balanced Accuracy
0.8201
16
Office of Research and Development
Center for Computational Toxicology and Exposure
ESI- Downsampled

Reference
Prediction
Present
Absent
Present
545
92
Absent
121
464
Sensitivity
0.8183
Specificity
0.8345
Balanced Accuracy
0.8264

-------
Current & future work
United States
Environmental Protection
Agency
Comparing model results to ENTACT results
-comparing predictions against independent labs, consensus of labs
Considering new metrics for model quality
-balanced accuracy not ideal when negative data may contain false negatives
Applicability domains for models under development
-global and local measures
Working with collaborators to improve available data
-data from additional potential collaborators would be GREATLY appreciated
17
Office of Research and Development
Center for Computational Toxicology and Exposure

-------
&EPA
United States
Environmental Protection
Agency
Contributing researchers
I S v°
Credit: the Research Triangle Foundation
Office of Research and Development
Center for Computational Toxicology and Exposure
EPA ORD
Hussein Al-Ghoul*
Alex Chao*
Louis Groff*
J a rod Grossman*
Chris Grulke
Kristin Isaacs
Sarah Laughlin*
Jon Sobus
Kamel Mansouri*
James McCord
Andrew McEachran*
Jeff Minucci
Seth Newton
Katherine Phillips
EPA ORD (cont.)
Tom Purucker
Ann Richard
Randolph Singh*
Mark Strynar
Elin Ulrich
John Wambaugh
Antony Williams
GDiT
llya Balabin
Tom Transue
Tommy Cathey
* = ORISE/ORAU

-------
Thank you for
Listening!

-------

-------