v»EPA
United States
Environmental Protection
Agency
Building a Non-Targeted Analysis Research
Program at the U.S. EPA
Jon R. Sobus, Ph.D. & th
Center for Computational Toxicology and Exposure
Research Triangle
Office of Research and Development
-------
vvEPA
United States
Environmental Protection
Agency
Agilent LC/Q-TOF Simplified Workflow
Experimental Acquisition
DB & Library Matching
J
Data Analysis
-------
vvEPA
United States
Environmental Protection
Agency
Agilent LC/Q-TOF Simplified Workflow
Experimental Acquisition
DB & Library Matching
J
Data Analysis
-------
EPA's DSSTox database: History of development of a curated chemistry
resource supporting computational toxicology research
Christopher M. Grulke3, Antony J. Williams21, Inthirany Thillanadarajahb, Ann M. Richard"' "
;| National Center for Computational Toxicology, Office of Research & Development, US Environmental Protection Agency, Mad Drop D143 02, Research Triangle Park, NC
27711, USA
b Senior Environmental Employment Program, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA
&EPA
United States
Environmental Protection
Agency
Chemical Database = DSSTox
Computational Toxicology 12 (2019) 100096
Contents lists available at ScienceDirect
Computational Toxicology
journal homepage: www.elsevier.com/locate/comtox
Fnvironmentai/Toxiciiy/Exposure Relevance
Curation Queue
EPA-
relevant lists
EPA ACToR
DSSTox V1
1:1:1 CA SRN-name-structure
Manually
curated
(conflicts
detected -
mt loaded)
ChemlD
PubGhem
Autoloads
to conflicts
allowed
-742K -160K ~56K ~24K
DSSTox >300K -582K -104K -32K -19K -5K
qc levels: Public Low Public Hiqh DSSTox Hi ah
PublicJJntrusted ~ Public Med " DSSTox Low
-------
Spiked Substance:
Tamoxifen
DTXSID1034187
Spiked Substance:
Tamoxifen citrate
Predicted Formula for Observed Molecular Feature:
c26h29no
Dashboard I Search
I
DTXSID8021301
1st: DTXSID1034187
2nd: DTXSID8021301
I
DTXCID9014187
MS-Ready
Processing
DTXCID9014187 DTXCID9014187
I
DTXCID9014187
MS-Ready Structures
&EPA
United States
Environmental Protection
Agency
-------
oEPA
United States
Environmental Protection
Agency
4r 0 M comptox.epa.gov/d3shboard
Dashboard Access
^~00 ©
vvEPA
United States
Environmental Protection Home Advanced Search Batch Search Lists v Predictioi s Downloads
Agency
—1
Share w I
Chemicals
Product/Use Categories Assay/Gene
tor chemical by systematic name, synonym, G
875 Thousand Chemicals
SID or InChlKey
Li Identifier substring search
See what people are saying, read the dashboard comments!
Cite the Dashboard Publication click here
'
CDSSTox MS Ready Mapping File J
UjS,£om£roxChenni£i^Da£]2|atfwH^an be
DSSTox MS Ready Mapping File V Posted: 11/14/2016
i£om£ToxChennirti^Da£]3btfwH^an be used by mass spectrometrists for the purpose of structure identification, A normal formula search would search the exact formula associated with any chemical, whether it include solvents of hydration, salts or
multiple components. However, mass spectrometry detects ionized chemical structures and molecular formulae searches should be based on desalted, and desolvated structures with stereochemistry removed. We refer to these as "MS ready structures"
and the MS-ready mappings are delivered as Excel Spreadsheets containing the Preferred Name, CAS-RN. DTXSID, Formula, Formula of the MS-ready structure and associated masses, SMILES and InChl Strings/Keys. (UPDATED APRIL 2019)
McE^chran ef ol J Cheminhrm f20!B) 10:45
https://dcH.arg/10.1186/s13321-018-0299-2
METHODOLOGY
Journal of Cheminformatics
Open Access
"MS-Ready" structures for non-targeted ^
high-resolution mass spectrometry screening
studies
Andrew D. McEaehran ,r, Kamel Mansouri Chris Grulke2, Emma L Schymanski4, Christoph Ruttkies5
and Antony J. Williams**
Cross Mark
Nicarbazin
DTXSID6034762
Ci9Hi8N606
C13H10N4O5/C6H8N2O
MS-Ready Forms
DTXCID8023761
C13H10N4O5
DTXCID50209864
C6H8N20
-------
vvEPA
United States
Environmental Protection
Agency
Agilent LC/Q-TOF Simplified Workflow
Experimental Acquisition
DB & Library Matching
J
Data Analysis
-------
oEPA
United States
Environmental Protection
Agency
EPA NTA WebApp
Feature Removal:
1) Duplicate features
2) Non-reproducible features
3) Blank features (sample:blank)
4) Non-responsive features (dilutions)
Feature Flagging:
1) Multi-mode hits (+ and -)
2) Meas. precision (CV threshold)
3) Formula match (score > threshold)
4) Negative mass defect
5) Halogenation
6) Has/is adduct
7) Has/is neutral loss
8) Has/is multimer
Dashboard Integration:
1) Data source & pub counts
2) Bioactivity & exposure levels
3) Presence on lists
4) Product & use categories
I Http;//127.0.0.1:8Q00/nta/
» ($ Search..,
P - I ~ ® © I
| Q.E.D. | US EPA
File Edit View Favorites Tools Help
&
w 13 BRl ~ Page ~ Safety ~ Tools ~ m-
vvEPA
NTA: non-targeted analysis of MS data (beta)
Run NTA
CONTACT US
NTA Aigartttims
NTACSAX3C
Positive MPP file |c«v(:
Negative MPP file (c*v):
ENTACT mea?
Adduct mass accuracy unlta:
Adduct mass accuracy.
Adduct retention time accuracy
(mln):
Tracer nie (csv: optional):
Tracer mass accuracy units:
Tracer retention time accuracy
(mln}:
Mln aample:blank cutoff:
Max replicate CV:
Parent ion maea accuracy (pp«n):
Search dashboard by:
Save top result only?
|Ejample nl
D:\gjfiqed_pfamVila_aiip' Browse...
D:'>9i?w}ed_pramVita_apo' Sio«se..
I"° vl
IpP1" V1
|PP"> vl
EE3
Save Metadata?
§150%
-------
vvEPA
United States
Environmental Protection
Agency
Agilent LC/Q-TOF Simplified Workflow
Experimental Acquisition
Sample Extracts
I
LC/Q-TOF HRMS
I
I
MS2
Acquisition
MS1
Acquisition
MS2 .d Files
MS2.mgf Files
DB & Library Matching
J
Reference MS2 Spectra
in silico MS2 Spectra
Data Analysis
Chemical Database
1
—1
r
r 1
DB MS-Ready Structures
r
DB MS-Ready Formula &
1
Monoisotopic Mass
Filtered Feature Table
Chemical Candidate Table
Aggregated Match Table
MS2 Reference
Matches
MS2 in silico Matches
8
-------
oEPA
United States
Environmental Protection
Agency
Generation of in Spectra
CFM-ID v2.0
Competitive fragmentation modeling of ESI-MS/MS
spectra for putative metabolite identification
Authors
Authors and affiliations
Felicity Allen S, RussGreiner, David Wishart
Linking in silico MS/MS spectra with
chemistry data to improve identification
of unknowns
Andrew D. McEachran llya Balabin, Tommy Cathey, Thomas R. Transue, Hussein Al-Ghoul, Chris
Grulke, Jon R. Sobus & Antony J. Williams ®
Machine Learning
Fragmentation
Prediction
Model
Training Set:
Metlin MS2 spectra
and structures
DSSTox MS-Ready
Structures
(-765,000)
DSSTox MS2
spectra
(10, 20, 40v)
McEachran, Andrew D., etal. Scientific data 6.1 (2019): 1-9
Allen, Felicity, et al. Metabolomics 11.1 (2015): 98-110.
-------
CFM-ID Database Matching
Analytical and Bioanatytical Chemistry (2020) 412:1301-1315
https://dol.cirg/10.1007/S00216-019-02351 - 7
RESEARCH PAPER
MGF file
1. Query dotobose
by mass
Check for
updates
In silico MS/MS spectra for identifying unknowns: a critical
examination using CFM-ID algorithms and ENTACT mixture samples
Alex Chao'*2© • Hussein Al-Ghoul1,2 • Andrew D. McEachran1-3 • llya Balabin4 • Tom Transue4 • Tommy Cathey" •
Jarod N. Grossman2,3 ¦ Randolph R. Singh1,5 • Elin M. Ulrich6 • Antony J. Williams7 • Jon R. Sobus6
Retrieve candidate compounds within mass window
Candidate 1
C19H20N2O3S
(Mass =
356.119)
Candidate 2
C19H20N2O3S
(Mass =
356.119)
Candidate 3
C21H21CIQ3
(Mass =
356.118)
In silico MS2
Spectra
(CE 10, 20, 40)
In silico MS2
Spectra
(CE 10, 20, 40)
In silico MS2
Spectra
(CE 10, 20, 40)
2. Score
in silico
spectra
CFM-ID Scores
in silico
CE 10
in silico
CE 20
in silico
CE 40
Candidate 1
0.5
0.3
0.1
Candidate 2
0.2
0.1
0.02
Candidate 3
0.1
0.05
0.01
-------
CFM-ID Database Matching (w/ Formula Information)
1. Query database
by mass
MGF file
Exp MS2
Spectrum
(Mass =
356.119)
Formula
Identified
Filter
candidates by
formula
Retrieve candidate compounds within mass window
Candidate 1
^19^20^2^3^
(Mass =
356.119)
Candidate 2
^19^20^2^3^
(Mass =
356.119)
ndidat^
C21H21CI ,
(M3SS =
356.118)
In silico MS2
Spectra
(CE 10, 20, 40)
In silico MS2
Spectra
(CE 10, 20, 40)
In silico MS2
Spectra
^CE^Oo^q^
2. Score
in silico
spectra
CFM-ID Scores
in silico
CE 10
in silico
CE 20
in silico
CE 40
Candidate 1
0.5
0.3
0.1
Candidate 2
0.2
0.1
0.02
oosB
-------
CFM-ID Database Matching (w/ Multiple CEexperjmenta|)
1. Query database
by mass
Exp MS2 CE10
Spectrum
(Mass =
356.119)
Exp MS2 CE20
Spectrum
(Mass =
356.119)
Exp MS2 CE40
Spectrum
(Mass =
356.119)
Retrieve candidate compounds within mass window
Candidate 1
^19^20^2^3^
(Mass =
356.119)
Candidate 2
^19^20^2^3^
(Mass =
356.119)
In silico MS2
Spectra
(CE 10, 20, 40)
In silico MS2
Spectra
(CE 10, 20, 40)
In silico MS2
Spectra
(CE 10, 20, 40)
2. Score
in silico
spectra
CFM-ID Scores
in silico
CE 10
in silico
CE 20
in silico
CE 40
Candidate 1
CEexp = 10
0.5
0.3
0.1
Candidate 1
CEexp = 20
0.4
0.5
0.12
Candidate 1
CEexp = 40
0.05
0.1
0.2
r
Candidate 3
C21H21CIO3
(Mass =
^ 356.118)
-------
CFM-ID Scoring Approaches
rz
o
i-
CL
Q.
<
Q.
Q.
<
Precursor 1
Experimental
Spectrum at CE 10
fM
.c
u
TO
O
i_
Q.
Q.
<
Experimental
Spectrum at CE 10
Experimental
Spectrum at CE 10
ro Experimental
° Spectrum at CE 20
Experimental
Spectrum at CE 40
N.
/'
/'
I
I
A
A
B
B
E
I
\
_L
i
I
A
1
1
1
1
1
1
a
i B
IC
Predicted
Spectrum at
CE 10
Candidate 1
Predicted
Spectrum at
CE 20
Predicted
Spectrum at
CE 40
c
c
-\
Score =
A
Score =
A+B+C
_ s
J \
Score =
A+B+C
Score =
D+E+F
Score =
G+H+I
Score =
A+B+C+D+E+
F+G+H+I
-------
EPA'S Non-Targeted Analysis Collaborative Trial
The Trial Mixtures:
• • • • •
10 Mixtures ranging from 95 to 365 compounds
(Total: 1,269 unique compounds)
"Pass" compounds = 377 with MS2 data
Agilent 1290 UPLC
Agilent 6530B Q-TOF with ESI source
Ulrich, Elin M., et al. Analytical and bioanalytical chemistry 411.4 (2019): 853-866.
Sobus, Jon R., et al. Analytical and bioanalytical chemistry 411.4 (2019): 835-851.
-------
&EPA
United States
Environmental Protection
Aaencv
Reference vs. in silicoLibrary Coverage
PCDL
88
CFM-ID
111
77
101
MS2 Library
% of"Pass"
Compounds
Identified
Agilent PCDL
53%
CFM-ID Top Hit
50%
PCDL arid/or
CFM-ID Top Hit
73%
ji
Pass" Compounds
PCDL -> Agilent reference MS2 library
15
"Pass" compounds (n=377) -> ENTACT
chemicals observed with MS2 data
-------
NTA Workflows: Using CFM-ID Results as Filters
Score
Filter out candidates
below score cutoff
Variability in score
distribution
Rank
Filter out candidates
above rank cutoff
Variability in number of
candidate compounds
Filter by Top 20
-------
Normalizing CFM-ID Results Values
Score Quotient
Normalize score to the
highest candidate
compound score
Rank
CFM-ID Score
Maximum Score
Score Quotient
Score Percentile
Candidate Compound 1
1
0.5
0.5
1
100
Candidate Compound 2
2
0.4
0.5
0.8
80
Candidate Compound 3
3
0.39
0.5
0.78
60
Candidate Compound 4
4
0.1
0.5
0.2
40
Candidate Compound 5
5
0.05
0.5
0.1
20
Score Percentile
Normalize rank to the
number of candidate
compounds
Score Quotient = Score / Maximum Score
-------
NTA Workflows: Using CFM-ID Normalized Results as Filters
Score Quotient
Filter out candidates
below score quotient
cutoff
Score Percentile
Filter out candidates
below percentile cutoff
Score quotient cutoff = 0.5
Keep candidates scoring at least half of max score
MS2 Spectrum 1
Candidate Scores
Score percentile cutoff = 0.5
Keep the top 50% of candidates
MS2 Spectrum 1
Candidate Scores
MS2 Spectrum 2
Candidate Scores
MS2 Spectrum 2
Candidate Scores
-------
Applying Cut-off Filters to Data
CFM-ID Score
Maximum Score
Score Quotient
Candidate
Compound 1
0.5
0.5
1
Candidate
Compound 2
o
o
Ln
O
CO
Candidate
Compound 3
0.39
0.5
0.78
Candidate
Compound 4
0.1 0.5 0.2
Candidate
Compound 5
0.05
0.5
0.1
i
+->
.Qj
+3
O
3
a
S!
o
o
0
-------
Applying Cut-off Filters to Data
0.5 0.5 1
0.4 0.5 0.8
0.39 0.5 0.78
0.1 0.5 0.2
0.05 0.5 0.1
^ True Compound
0 Other Candidate Compounds
True Positives
False Negatives
True Negatives
False Positives
-------
Applying Cut-off Filters to Data
CFM-ID Score
Maximum Score
Score Quotient
Candidate
Compound 1
0.5
0.5
1
Candidate
Compound 2
o
o
Ln
O
CO
Candidate
Compound 3
0.39
0.5
0.78
Candidate
Compound 4
0.1 0.5 0.2
Candidate
Compound 5
0.05
0.5
0.1
True Compound
Other Candidate Compounds
True Positives
1
False Negatives
0
True Negatives
0
False Positives
4
Score Quotient
Cut-off = 0
-------
Applying Cut-off Filters to Data
CFM-ID Score
Maximum Score
Score Quotient
Candidate
Compound 1
0.5
0.5
1
Candidate
Compound 2
o
o
Ln
O
CO
Candidate
Compound 3
0.39
0.5
0.78
Candidate
Compound 4
0.1 0.5 0.2
Candidate
Compound 5
0.05
0.5
0.1
True Compound
Other Candidate Compounds
True Positives
1
False Negatives
0
True Negatives
2
False Positives
2
Score Quotient
Cut-off = 0.5
-------
Applying Cut-off Filters to Data
CFM-ID Score
Maximum Score
Score Quotient
Candidate
Compound 1
0.5
0.5
1
Candidate
Compound 2
o
o
Ln
O
CO
Candidate
Compound 3
0.39
0.5
0.78
Candidate
Compound 4
0.1 0.5 0.2
Candidate
Compound 5
0.05
0.5
0.1
True Compound
Other Candidate Compounds
True Positives
0
False Negatives
1
True Negatives
3
False Positives
1
Score Quotient
Cut-off = 0.9
-------
Balancing Cut-offs
TP
True Positive Rate (TPR) = ———
v 7 TP + FN
FP
False Positive Rate (FPR) = ———
v 7 FP + TN
How many of the true
compounds are we keeping
How much of the junk are
we getting rid of?
-------
Quotient Vs. Percentile Cutoffs
Global ROC Curves (All ENTACT Mixtures)
10 -
0.8 -
0.6 -
0.4 -
0.2
0.0 -
Quotient (by formula)
Percentile (by formula)
Quotient (by mass)
Percentile (by mass)
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
-------
Quotient Vs. Percentile Cutoffs
10
0.8
a?
-i—1
CO
® 0.6
>
O)
O
^ 0.4
CD
3
Global ROC Curves (All ENTACT Mixtures)
0.2
Quotient (by formula)
Percentile (by formula)
Quotient (by mass)
Percentile (by mass)
False Positive Rate
+->
.CD
+3
O
3
a
QJ
O
U
0 Score Quotient Cut-off
A. True Compounds
# Other Candidate Compounds
-------
Quotient Vs. Percentile Cutoffs
10
0.8
a?
-i—1
CO
® 0.6
>
O)
O
^ 0.4
CD
3
Global ROC Curves (All ENTACT Mixtures)
0.2
Quotient (by formula)
Percentile (by formula)
Quotient (by mass)
Percentile (by mass)
False Positive Rate
+->
.CD
+3
O
3
a
QJ
O
U
0
Score Quotient Cut-off
A. True Compounds
$ Other Candidate Compounds
-------
Quotient Vs. Percentile Cutoffs
10
0.8
a?
-i—1
CO
® 0.6
>
O)
O
^ 0.4
CD
3
Global ROC Curves (All ENTACT Mixtures)
0.2
Quotient (by formula)
Percentile (by formula)
Quotient (by mass)
Percentile (by mass)
False Positive Rate
0
+->
.QJ
§ m Score Quotient Cut-off
O
QJ
O
U
A True Compounds
$ Other Candidate Compounds
-------
Quotient Vs. Percentile Cutoffs
10
0.8
a?
-i—1
CO
® 0.6
>
O)
O
^ 0.4
CD
3
Global ROC Curves (All ENTACT Mixtures)
0.2
Quotient (by formula)
Percentile (by formula)
Quotient (by mass)
Percentile (by mass)
False Positive Rate
Cut-off Values for Global TPR = 0.9
Cut-off value
Quotient (by formula)
0.18
Percentile (by formula)
38
Quotient (by mass)
0.13
Percentile (by mass)
32
Apply to
individual
ENTACT mixtures
-------
CFM-ID Cut-off Filtering: Individual ENTACT Mixtures
»s
sC^
Max
Min
-------
Experimental Acquisition
Database/Library Matching
Data Analysis
r '
Sample Extracts
L. .
1
©
r
LC-QTOF/MS
L. .
1
©
f 1
©
r
MS2
Acquisition
MSI
Acquisition
©
4©
©©©.
DSSTox Database1
(~875,000 Substances)
J©
DSSTox MS-Ready Structures2
w
DSSTox MS-Ready Formulae
1©
Agilent PCDL.csv
(DSSTox MS-Ready Formulae)
©
MS2 Acquisition .d Files
L 1
-2-
Agilent PCDL
(Reference MS2 Spectra)
/'-N
r ^
MS2 Exported .mgf Files
CFM-ID Database4
(in silico MS2 Spectra)
©
MSI Feature Table
Ts
Filtered Feature Table
lo
Chemical Candidate Table
(Dashboard MetaData)
Aggregated Match Table
3©
PCDL Results Table
(Manually Reviewed)
©
CFM-ID Results Table
(Percentiles & Quotients)
o
o
I—
66
CD
i—
CO
O
-------
Contributing Researchers
^DS7^
£
s
jj
%
w
"6
z
LLI
CD
T
^ PRO^°
This work was
supported, in
part, by ORD's
Pathfinder
Innovation
Program (PIP)
and an ORD
EMVL award
EPA ORD
Hussein Al-Ghoul*
Alex Chao*
J a rod Grossman*
Kristin Isaacs
Sarah Laughlin*
Charles Lowe
James McCord
Jeff Minucci
Seth Newton
Katherine Phillips
Tom Purucker
Randolph Singh*
Mark Strynar
Elin Ulrich
EPA ORD (cont.)
Chris Grulke
Kamel Mansouri*
Andrew McEachran*
Ann Richard
John Wambaugh
Antony Williams
Agilent
Jarod Grossman
Andrew McEachran
GDIT
llya Balabin
Tom Transue
Tommy Cathey
* _
- ORISE/ORAU
-------
Questions?
# SS
ISSjgZi
sobus.jon@epa.gov
The views expressed in this presentation are
those of the authors and do not necessarily
represent the views or policies of the U.S.
Environmental Protection Agency.
earch
Foundation
------- |