oEPA
United States
Environmental Protection
Agency
Building a Non-Targeted Analysis Research
Program at the U.S. EPA
Jon R. Sobus, Ph.D. & th
Center for Computational Toxicology and Exposure
Research Triangle
Office of Research and Development
-------
Jon 'Nature Boy' Sobus
gn£9mS
Alex 'Can Do' Chao
Mark 'Blue Steel' Strynar
James 'Shake-n-Bake' McCord
Seth 'Nice guy' Newton
Hannah 'Dr. Cool' Liberatore
The Unflappable Ariel Wallace
" ^ , I *4
Nelson 'Prints' Yeung
Tom 'Mystery Man' Purucker
Adventurin' Jeff Minucci
Current
NTA
Team
Elin 'Da Boss' Ulrich
909S
4 ni. at; «£ 1 '
j|*B % a 1. ¦ ¦ l «
spa
,«.»»»
-------
xvEPA
United States
Environmental Protection
Agency
Key Drivers for 21st Century Exposure Science
1) Understanding causes of disease
.. 70-90% of disease risks are
probably due to differences in
environments"
EPIDEMIOLOGY
Environment and Disease Risks
Stephen M. Rappaport and Martyn T. Smith
Although the risks of developing
chronic diseases are attributed to
both genetic and environmental fac-
tors, 70 to 90% of disease risks are probably
due to differences in environments {J—3). Yet
epidemiologists increasingly use genome-
wide association studies (GWAS) to investi-
gate diseases, while relying on questionnaires
to characterize "environmental exposures."
This is because GWAS represent the only
approach for exploring the totality of any risk
factor (genes, in this case) associated with dis-
ease prevalence. Moreov er, the value of costly
genetic information is diminished when inac-
curate and imprecise environmental data lead
to biased inferences regarding gene-environ-
ment interactions {4), A more comprehensive
and quantitative view of environmental expo-
School of Pubtk Heallh, University of California, Berkeley,
CA 94720-7356, USA. E-mait srappaport@berkeley.edu
sure is needed if epidemiologists are to dis-
cover the major causes of chronic diseases.
An obstacle to identifying the most
important environmental exposures is the
fragmentation of epidemiological research
along lines defined by different factors.
When epidemiologists investigate environ-
mental risks, they tend to concentrate on a
particular category of exposures involving
air and water pollution, occupation, diet
and obesity, stress and behavior, or types
of infection. This slicing of the disease pie
along parochial lines leads to scientific
separation and confuses the definition of
"environmental exposures." In fact, all of
these exposure categories can contribute to
chronic diseases and should be investigated
collectively rather than separately.
To develop a more cohesive view of envi-
ronmental exposure, it is important to recog-
nize that toxic effects are mediated through
A new paradigm is needed to assess how a
lifetime of exposure to environmental factors
affects the risk of developing chronic diseases.
chemicals that alter critical molecules, cells,
and physiological processes inside the body.
Thus, it would be reasonable to consider
the "environment" as the body's internal
chemical environment and "exposures" as
the amounts of biologically active chemi-
cals in this internal environment. Under this
view, exposures are not restricted to chemi-
cals (toxicants) entering the body from air,
water, or food for example, but also include
chemicals produced by inflammation, oxida-
tive stress, lipid peroxidation, infections, gut
flora, and other natural processes (5, 6) (see
the figure). This internal chemical environ-
ment continually fluctuates during life due
to changes in external and internal sources,
aging, infections, life-style, stress, psychoso-
cial factors, and preexisting diseases.
The term "exposome" refers to the total-
ity of environmental exposures from concep-
tion onwards, and has been proposed to be a
460
22 OCTOBER 2010 VOL 330 SCIENCE www.sciencemag.org
Published by MAS
2) Ensuring chemical safety
GIVE A DOG A PHONE
Technology for our furry friends
NewScientist
We've made
150,000 new chemicals
9
We touch them,
we wear them, we eat them
But which ones should
we worry about?
SPECIAL REPORT, pa$e 34
THE GOOD FIGHT CHAMBER OF SECRETS IS IT AUVE>
Mott violence The greatest ever find Artificial worm could
is also virtuous of eirty human bones be first Ogital arwmaJ
2
-------
oEPA
United States
Environmental Protection
Agency
High-Throughput Risk Characterization
Many industrial & commercial chemicals are covered by the
Toxic Substances Control Act (TSCA), which is
administered by EPA.
TSCA updated in June 2016 to allow
evaluation of existing and new chemicals.
Characterization of risk requires exposure and hazard data.
EPA's Office of Research and Development (ORD) is
developing new approach methodologies (NAMs) for rapid
risk characterization.
NTA is a promising NAM, but requires careful evaluation
and implementation
'70,000 Chemicals on the TSCA
Inventory
Risk-Based
Prioritization
-------
NTA Research Produces Critical Data
Environmental Protection
Agency
Top-Down Exposomics via NTA
Measure Important Exposures
Within the Receptor
Editorial
Complementing the Genome with an "Exposome":
The Outstanding Challenge of Environmental
Exposure Measurement in Molecular Epidemiology
Christopher Paul Wild
id ThcrapCTilKs, F.
:y of Medicine and
Epidemiology ai
d Health, Uitivra
Is Instilule of Genetics, Health
!. Leeds, United Kingdccn
The sequencing and mapping of the human genome
provides a foundation for the elucidation of gene expression
and protein function, and the identification of the biochemical
pathways implicated in the natural history of chronic diseases,
including cancer, diabetes, and vascular and neurodegenera-
tive diseases. This knowledge may consequently offer oppor-
tunities few a more effective treatment and improved patient
manage
UK Biobank will recruit half a million people at a cost of
around £60 million (5110 million) in the initial phase. The
proposal to establish a "Last Cohort" of 1 million people in the
United States (7) or a similar-sized Asian cohort (8) would
presumably exceed this sum In each case, the high cost is
heavily influenced by the collection and banking of biological
material This expense is predicated on the assumption that
on gene
highligl
gene tht
human
docume
identify
diseases
Applica
clinic to
of epide
morbidi
realized
It t
low per
environ]
hemosta
nucleoli
penetrai
All "...life-course
environmental exposures
(including lifestyle
factors) from the prenatal
period onwards..."
themsel
in those common chronic diseases mentioned above, which
constitute the major health burden in economically developed
countries (3, 4). Despite this, many exposure-disease associa-
tions remain ill defined and the complex interplay with genetic
susceptibility is only beginning to be addressed. This raises the
question as to whether fundamental knowledge about genetics
will improve understanding of disease etiology at the
population level.
The new generation of mega-cohort studies, including die
UK Biobank or similar proposed US and Asian cohorts (5-8),
provides the framework for such investigations of genetic
variation, environment, lifestyle, and chronic disease. At die
same time, they represent substantial investment. For example,
¦» (USA) gt«M IU>. E50M52.
will
self-
^ent-tic
es and
itistical
sets are
Droved
enetic),
ilecular
marker
proved
tors to
can be
disease
>ve the
its has
lat has
of the
ctation
ict that
case-control study design. For laboratories involved in
molecular cancer epidemiology, gene-disease association stud-
ies offered rapid gains in research output. The literature is now-
replete with meta-analyses of these data. The studies that have
been conducted have, by some accounts, yielded only a
modicum of success with relatively few reproducible findings
(see for example nef. 12). More recendv, improvements in
study design have been suggested, notably by increasing
subject numbers and by analyzing multiple polymorphisms, of
functional relevance (13). A more comprehensive coverage of
the genome and the possibility to examine the interplay
between single nucleotide polymorphisms are now feasible
through the application of microarray technology (14). It is
predictable that as costs decrease, there will emerge analyses of
existing studies on a grander scale. The consequence may not
be greater clarity but a greater number of chance findings and
an increasing difficulty of dealing with the sheer volume of
data in the absence of parallel advances in data analysis.
Things may get worse before they get better.
Cancer Epidemiol Bio markers Prev 2005; 14(8). August 2005
Bottom-Up Exposomics via NTA
Measure Important Exposures
in All Relevant Media
Figure adapted from: Rappaport SM. J Expo Sci Environ
Epidemiol. 2011 Jan-Feb;21(l):5-9.
Inhalation
Wildlrfe
Ingestion
Deposition
jo Ground
= Direct -
ILxposurc
Animal Product
Ingestion
Irrigation
Crop Ingestion
Shoreline r.xpostirt1
Auualic Food
mtn'siuni
Drinking
Water
Ingestion
-------
Our HRMS Tools of the Trade
Environmental Protection
Agency
Thermo GC/Q Exactive
Hybrid Quad-Orbitrap
Thermo LC/Orbitrap
Fusion Tribrid
A*
Mill
Agilent 7250 GC/Q-TOF
Agilent 6546 LC/Q-TOF
Agilent 6530B
LC/Q-TOF
-------
NTA Applications at EPA
United States
Environmental Protection
Agency
Exposure surveillance
• What chemicals are in water, products, dust, blood, etc.?
Chemical prioritization
• What are relevant chemicals & mixtures?
Exposure forensics
• What are chemical signatures of exposure sources?
Biomarker discovery
• What chemicals are associated with health impairment?
6
Office of Research and Development
-------
oEPA
United States
Environmental Protection
Agency
Exposure Surve llance for Consumer Products
fj Cite This: invkon. Scl. Technol. 201S, 52, 3125-3135
pubsj3cs.org/est
Suspect Screening Analysis of Chemicals in Consumer Products
Katherine A Phillips. Alice Yau, Kristin A. Favela, Kristin K Isaacs, Andrew McEachran,
Christopher Grulkc.' Ann M. Richard,' Antony J. Williams,' Jon R. Sobus, Russell S. Thomas,"
and John F. Wambaugh* "
National Exposure Research Laboratory, Office of Research and Development, U.S. Environmental
Alexander Drive, Research Triangle Park, North Carolina 27711, United States
Southwest Research Institute, San. Antonio, Texas 78238, United States
Oak Ridge Institute lor Science and Education (ORISE), Oak Ridge, Tennessee 37830, United Stal
National Center for Computational Toxicology, Office of Research and Development, U.S. Environr
T. W. Alexander Drive, Research Triangle Park, North Carolina 27711, United States
Office of Research and Development
19% of chemicals
identified by NTA are on
consumer product
chemical lists
Articles
Confirmed or tentative
chemical identification
Previously known chemical
related to consumer products
Formulations
Foods
Carpet
Cotton Clothing
Carpet Padding
Fabric Upholstery
Shower Curtain
Vinyl Upholstery
Plastic Children's Toy
Lipstick
Toothpaste
Sunscreen
Indoor House Paint
Shaving Cream
Hand Soap
Skin Lotion
Baby Soap
Deodorant
Shampoo
Glass Cleaner
Air Freshener
Cereal
¦KTh
¦n?
CD-
IQ-
^03-
|—C±Z
-OZF-
H-
-D>
—m
i—t-
HXh
-m-
-M
250
200 150 100 50
Unique Chemicals
-4
-2
Iogio0jg/g)
-------
oEPA
United States
Environmental Protection
Agency
Chemical Prioritization for Drinking Water
Enviionmenul Pollution 234 (2018) 297 306
ELSEVIER
Contents lists available at ScienceDirect
Environmental Pollution
journal homepage: www.elsevier.com/locate/envpol
EXvmoNMDrrAJ.
POLLUTION
Suspect screening and non-targeted analysis of drinking water using
point-of-use filters'*
Seth R. Newton a' \ Rebecca L McMahen Jon R. Sobus J, Kamel Mansouri J'c- \
Antony J. Williams c, Andrew L). McEachran b c, Mark J, StrynarJ
a Unt£t?d ifcires fiiWrortmenuii Protection Agency. National Exposure Research Laboratory, Research Triangle Park, JVC 27709, United Stales
b Oak Ridge Institute for Science and Education Research Participant, 109 TW. Alexander Drive, Research 'Triangle Park, NC 27709. United States
c United States Environmental Protection Agency, National Center for Computational Toxicology, Research Triangle fork. NL 27709, United States
H)
HRT1A
3
III
Top 20 Priority
Compounds
L-. I 11 ¦ 11 i
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#
Compound
ToxPi
Score
1
l,2-8enzisothiaiolh-3-one*
2.99
2
Diethyleneglycol
2.38
3
N-[3-(Dimethylamino (propyl)
2.32
methacrylamide
4
Nonylparaben
2.22
5
Dipentylphthalate
1.89
5
2-[2-(2-Butoxyethoxy)
1.85
ethoxyJethanoT
7
N,N-Dimethyldodecan-
1-amine*
1.81
8
Sucralose
1.80
9
PFOS*
1.79
10
2-(2-Ethoxyethoxy)
1.76
ethyl acetate*
11
TDCPP*
1.71
12
Zearalano)
1.67
13
PFOA*
1.66
14
Butyl paraben
1.66
15
Noristerat
1.65
16
p-Synephfine
1.55
17
Alprostadil
1.55
18
Sclareol
1.55
19
PFDA*
1.51
20
Simvastatin
1.50
•Confirmed with standard
Top 100 Priority Compounds
-------
oEPA
United States
Environmental Protection
Agency
Exposure Forensics for Recycled Products
Ubiquitous chemicals in articles (e.g., phthalates)
i « i.'./" i t
Jf -to?,
||i«
Fragrances in recycled paper products
Product Category
Construction
Fabric and Home Goods
Food Contact
Paper
¦¦ Plastic Home & Auto
Recycled Tire Products
Toys & Ray Mats
Jd
mi r
Chemicals (variety of functions) that only occur in recycled tire products
Fragrances in toys
IE*
!
%
;! i
Tentatively Identified and Confirmed Chemicals
Figure by C. Lowe and K. Isaacs
Toys/Play Mats
Recycled Tire
Products
Household
Products
Paper
Products
Food Contact
Materials
Fabric Home
Goods
Construction
Materials
0 5 10 15
Mean Number of Chemicals per Sample
Classification Recycled Virgin
-------
Altered Cell Signaling
The Placental Exposome
(via LC-HRMS)
Different Environmental
Exposures
Biomarker Discovery for Placenta Samples
Environmental Protection ®
Agency
29 in
Preeclampsia
~
Impaired Angiogenesis
*
-5 0 5
Log2 Fold Change
controls
508 n cases
10
Collaboration with J. Rager (UNC Chapel Hill) and J. Grossman (Agilent)
-------
vvEPA
United States
Environmental Protection
Agency
NTA Best Practices
Name
Example
Purpose
Tracers
Isotopically labeled standards: 13C3-Atrazine,
D3-Thiamethoxam, 13C4,15N2-Fipronil
Allows tracking of chromatographic performance
and mass accuracy
Replication
Triplicate injections of same sample vial
Removes risk of "one hit wonder"
Run order 8, 3, 7, 4, 2, 1, 10, 5, 8, 6, 9, 2, 5, 4, 1, 9, 4, 7, 3, 8, 1, Minimizes/averages out batch or sample order
randomization 6, 10, 9, 6, 7, 5, 3, 2, 10 effects (e.g., carryover, temp & instrument drift)
Pooled QC sample
Combine 5 mg/|jL from each of 10 samples (total 50
mg/|jL) prior to extract to create pooled QC
Separate confirmation of presence with different
matrix, MS2 IDs
Blanks
Solvent, method, matrix, double blanks
Allows identification/subtraction/deletion of
interferences introduced in lab processes
Multiple lines of
evidence for ID
RT prediction/matching, spectra prediction/matching,
data source ranking, functional/product uses, media
occurrence
Improves confidence in identification when
chemicals standards are unavailable
ii
-------
vvEPA
United States
Environmental Protection
Agency
Agilent LC/Q-TOF Simplified Workflow
Experimental Acquisition
DB & Library Matching
J
Data Analysis
12
-------
vvEPA
United States
Environmental Protection
Agency
Agilent LC/Q-TOF Simplified Workflow
Experimental Acquisition
Sample Extracts
I
LC/Q-TOF HRMS
I
I
MS2
Acquisition
MS1
Acquisition
MS2 .d Files
MS2.mgf Files
DB & Library Matching
J
Chemical Database
L. J
1
1
r
r 1
DB MS-Ready Structures
r
DB MS-Ready Formula &
1
Monoisotopic Mass
Reference MS2 Spectra
in silico MS2 Spectra
Data Analysis
Filtered Feature Table
Chemical Candidate Table
Aggregated Match Table
MS2 Reference
Matches
MS2 in silico Matches
13
-------
oEPA
United States
Environmental Protection
Agency
* J
• •
• •
1,269
Substances
in 10
Mixtures
Office of Research and Development
Experimental Acquisition
Agilent 6530B Q-TOF
-------
vvEPA
United States
Environmental Protection
Agency
Agilent LC/Q-TOF Simplified Workflow
Experimental Acquisition
DB & Library Matching
J
Data Analysis
15
-------
oEPA
United States
Environmental Protection
Agency
Chemical Database = DSSTox
Autoloads:
no conflicts
allowed
Environmental/Toxicity/Exposure Relevance
Curation Queue
EPA-
relevant lists
EPA ACToR
1:1:1 CA SRN-name-structure
\
>1M
(conflicts
detected -
not loaded)
PubChem
V i
\J
~
r
N^hemlD
/ 4 \
l
I
T~EPAk
SRS
i
" _L
i
l
I
-742K
DSSTox_V1
Manually ,
^ curated
~5K
DSSTox
qcjevels:
>300K
-160K ~56K
-582K -104K
-24K
-32K
~19K ~5K
Computational Toxicology 12 (2019) 100096
ELSEVIER
Contents lists available at ScienceDirect
Computational Toxicology
journal homepage: www.elsevier.com/locate/comtox
Public.Low Public_High DSSTox High
PublicJJntrusted Public Med DSSTox Low
EPA's DSSTox database: History of development of a curated chemistry
resource supporting computational toxicology research
Christopher M. Grulke", Antony J. Williams'1, Inthirany Thillanadarajahb, Ann M. Richard"'*
;| National Center for Computational Toxicology, Office of Research & Development, US Environmental Protection Agency, Mad Drop D143 02, Research Triangle Park, NC
27711, USA
b Senior Environmental Employment Program, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA
K)
-------
oEPA
United States
Environmental Protection
Agency
V-
MS-Ready Structures
Spiked Substance:
Tamoxifen
I
DTXSID1034187
Spiked Substance:
Tamoxifen citrate
I
DTXSID8021301
O 0
OH
HO ^O
Predicted Formula for Observed Molecular Feature:
c26h29no
Dashboard I Search
1st: DTXSID1034187
I
2nd: DTXSID8021301
O O
r y~°
HO ^O
MS-Ready
Processing
DTXCID9014187
DTXCID9014187
DTXCID9014187
DTXCID9014187
-------
oEPA
United States
Environmental Protection
Agency
4r 0 M comptox.epa.gov/d3shboard
Dashboard Access
^~00 ©
vvEPA
United States
Environmental Protection Home Advanced Search Batch Search Lists v Predictioi s Downloads
Agency
Share
875 Thousand Chemicals
Chemicals
Product/Use Categories Assay/Gene
Q. Search for chemical by systematic name, synonym, CAS number, DiXSID or InChlKey
D Identifier substring search
See what people are saying, read the dashboard comments!
Cite the Dashboard Publication click here
DSS
DSSTox MS Ready Mapping File Posted: 1 '\/V
be used by mass spectrometrists for the purpose of structure identification, A normal formula search would search the exact formula associated with any chemical, whether it include solvents of hydration, salt:
multiple components. However, mass spectrometry detects ionized chemical structures and molecular formulae searches should be based on desalted, and desolvated structures with stereochemistry removed. We refer to these as "MS ready structures
and the MS-ready mappings are delivered as Excel Spreadsheets containing the Preferred Name, CAS-RN. DTXSID, Formula, Formula of the MS-ready structure and associated masses, SMILES and InChl Strings/Keys. (UPDATED APRIL 2019)
McEadhran et ol. J Cheminform {2018} 10:45
https://doi.org/10.11 &6/s 13321-018-0299-2
METHODOLOGY
Journal of Cheminformatics
Open Access
"MS-Ready" structures for non-targeted ^
high-resolution mass spectrometry screening
studies
Andrew D. McEaehran ,r, Kamel Mansouri Chris Grulke2, Emma L Schymanski4, Christoph Ruttkies5
and Antony J. Williams**
Cross Mark
Nicarbazin
DTXSID6034762
C19H18N606
C13H10N4O5/C6H8N2O
MS-Ready Forms
DTXCID8023761
DTXCID50209864
C6H8N20
-------
vvEPA
United States
Environmental Protection
Agency
Agilent LC/Q-TOF Simplified Workflow
Experimental Acquisition
DB & Library Matching
J
Data Analysis
19
-------
oEPA
United States
Environmental Protection
Agency
EPA NTA WebApp
Feature Removal:
1) Duplicate features
2) Non-reproducible features
3) Blank features (sample:blank)
4) Non-responsive features (dilutions)
Feature Flagging:
1) Multi-mode hits (+ and -)
2) Meas. precision (CV threshold)
3) Formula match (score > threshold)
4) Negative mass defect
5) Halogenation
6) Has/is adduct
7) Has/is neutral loss
8) Has/is multimer
Dashboard Integration:
1) Data source & pub counts
2) Bioactivity & exposure levels
3) Presence on lists
4) Product & use categories
) V> http://127.0.0.1:8000/nta/
* d Search...
P - Q ~ ® ©
| Q.E.D. | US EPA
File Edit View Favorites Tools Help
&
a - 0 * a # ' Page ~ Safety ~ Tools ~
NTA: non-targeted analysis of MS data (beta)
Run NTA
CONTACT US
Positive MPP me (CSV):
Negative mpp nie (cev);
ENTACT lues?
Adduct mass accuracy unite:
Adduct mass accuracy:
Adduct retention time accuracy
(mln):
Tracer file (cav; optional]:
Tracer mass accuracy untts:
Tracer retention time accuracy
(mln):
Mln aample:Clsr.K cutoff:
Parent ion mass accuracy (p
Search dashboard by:
Save top result only?
|E*ample nl
D:'igjf>qed_praniVila_appl Browse..
D:VgKiqed_praminta_a
D:'igiliijcd_praml/ita_apn' Browse..
IPP™ vl
Save Metadata?
^.50% -
-------
vvEPA
United States
Environmental Protection
Agency
Agilent LC/Q-TOF Simplified Workflow
Experimental Acquisition
Sample Extracts
I
LC/Q-TOF HRMS
I
I
MS2
Acquisition
MS1
Acquisition
MS2 .d Files
MS2.mgf Files
DB & Library Matching
J
Chemical Database
1
—1
r
r 1
DB MS-Ready Structures
r
DB MS-Ready Formula &
1
Monoisotopic Mass
Reference MS2 Spectra
in silico MS2 Spectra
Data Analysis
Filtered Feature Table
Chemical Candidate Table
Aggregated Match Table
MS2 Reference
Matches
MS2 in silico Matches
21
-------
oEPA
United States
Environmental Protection
Agency
Generation of in Spectra
CFM-ID v2.0
Competitive fragmentation modeling of ESI-MS/MS
spectra for putative metabolite identification
Authors
Authors and affiliations
Felicity Allen S, Russ Greirver, David Wishart
Linking in silico MS/MS spectra with
chemistry data to improve identification
of unknowns
Andrew D. McEachran llya Balabin, Tommy Cathey, Thomas R. Transue, Hussein Al-Ghoul, Chris
Grulke, Jon R. Sobus & Antony J. Williams ™
Machine Learning
Fragmentation
Prediction
Model
Training Set:
Metlin MS2 spectra
and structures
DSSTox MS-Ready
Structures
(-765,000)
DSSTox MS2
spectra
(10, 20, 40v)
McEachran, Andrew D., etal. Scientific data 6.1 (2019): 1-9
Allen, Felicity, et al. Metabolomics 11.1 (2015): 98-110.
-------
CFM-ID Database Matching
MGF file
1. Query do to base
by mass
Retrieve candidate compounds within mass window
Candidate 1
^19^20^2^3^
(Mass =
356.119)
Candidate 2
c19^n2o3s
(Mass =
356.119)
Candidate 3
*-21^23^"'^3
(Mass =
356.118)
In silico MS2
Spectra
(CE 10, 20, 40)
In silico MS2
Spectra
(CE 10, 20, 40)
In silico MS2
Spectra
(CE 10, 20, 40)
2. Score
in silico
spectra
CFM-ID Scores
in silico
CE 10
in silico
CE 20
in silico
CE 40
Candidate 1
0.5
0.3
0.1
Candidate 2
0.2
0.1
0.02
Candidate 3
0.1
0.05
0.01
-------
CFM-ID Database Matching (w/ Formula Information)
1. Query database
by mass
MGF file
Exp MS2
Spectrum
(Mass =
356.119)
Formula
Identified
Filter
candidates by
formula
Retrieve candidate compounds within mass window
Candidate 1
^19^20^2^3^
(Mass =
356.119)
Candidate 2
^19^20^2^3^
(Mass =
356.119)
In silico MS2
Spectra
(CE 10, 20, 40)
In silico MS2
Spectra
(CE 10, 20, 40)
r i
i spectra
In silica MS2
i
Spectra
i
i
(CE ID, 20, 40}
i
i
^ J
w
i
2. Score
in silico
CFM-ID Scores
in silico
CE 10
in silico
CE 20
in silico
CE 40
Candidate 1
0.5
0.3
0.1
Candidate 2
0.2
0.1
0.02
oo
r
Candidate 3
c21h21cio,
(Mass =
356.118)
-------
CFM-ID Database Matching (w/ Multiple CEexperimenta|)
1. Query database
by mass
Exp MS2 CE10
Spectrum
(Mass =
356.119)
Exp MS2 CE20
Spectrum
(Mass =
356.119)
Exp MS2 CE40
Spectrum
(Mass =
356.119)
Retrieve candidate compounds within mass window
Candidate 1
^19^20^2^3^
(Mass =
356.119)
Candidate 2
^19^20^2^3^
(Mass =
356.119)
In silico MS2
Spectra
(CE 10, 20, 40)
In silico MS2
Spectra
(CE 10, 20, 40)
In silico MS2
Spectra
(CE 10, 20, 40)
2. Score
in silico
spectra
CFM-ID Scores
in silico
CE 10
in silico
CE 20
in silico
CE 40
Candidate 1
CEexp=10
0.5
0.3
0.1
Candidate 1
CEexp=20
0.4
0.5
0.12
Candidate 1
CEexp = 40
0.05
0.1
0.2
r
Candidate 3
C21H21CIO3
(Mass =
356.118)
-------
CFM-ID Scoring Approaches
re
o
a
a.
<
Precursor 1
Experimental
a. Spectrum at CE 10
re
o
i—
a
a
<
Experimental
Spectrum at CE 10
Experimental
Spectrum at CE 10
ra Experimental
° Spectrum at CE 20
Experimental
Spectrum at CE 40
i
V
/'
X.
/-
V
t
Predicted
Spectrum at
CE 10
A
A
A
Candidate 1
Predicted
Spectrum at
CE 20
B
B
B
E
I
Predicted
Spectrum at
CE 40
c
c
c
Score =
&
Score =
A+B+C
\ /'
. /
—N
Score =
A+B+C
Score =
D+E+F
Score =
G+H+I
Score =
A+B+C+D+E+
F+G+H+I
-------
EPA'S Non-Targeted Analysis Collaborative Trial
The Trial Mixtures:
• • • •
»' • • •
10 Mixtures ranging from 95 to 365 compounds
(Total: 1,269 unique compounds)
"Pass" compounds = 377 with MS2 data
Agilent 1290 UPLC
Agilent 6530B Q-TOF with ESI source
Ulrich, Elin M,, et al. Analytical and bioanalytical chemistry 411.4 (2019): 853-866.
Sobus, Jon R., et al. Analytical and bioanalytical chemistry 411.4 (2019): 835-851.
-------
xvEPA
United States
Environmental Protection
Aaencv
Reference vs. in silicoLibrary Coverage
PCDL
88
CFM-ID
111
77
101
MS2 Library
% of "Pass"
Compounds
Identified
Agilent PCDL
53%
CFM-ID Top Hit
50%
PCDL and/or
CFM-ID Top Hit
73%
ji
Pass" Compounds
PCDL -> Agilent reference MS2 library
28
"Pass" compounds (n=377) -> ENTACT
chemicals observed with MS2 data
-------
NTA Workflows: Using CFM-ID Results as Filters
Score
Filter out candidates
below score cutoff
Variability in score
distribution
Rank
Filter out candidates
above rank cutoff
Variability in number of
candidate compounds
Filter by Top 20
-------
Normalizing CFM-ID Results Values
Score Quotient
Normalize score to the
highest candidate
compound score
Rank
CFM-ID Score
Maximum Score
Score Quotient
Score Percentile
Candidate Compound 1
1
0.5
0.5
1
100
Candidate Compound 2
2
0.4
0.5
0.8
80
Candidate Compound 3
3
0.39
0.5
0.78
60
Candidate Compound 4
4
0.1
0.5
0.2
40
Candidate Compound 5
5
0.05
0.5
0.1
20
Score Percentile
Normalize rank to the
number of candidate
compounds
Score Quotient = Score / Maximum Score
-------
NTA Workflows: Using CFM-ID Normalized Results as Filters
Score Quotient
Filter out candidates
below score quotient
cutoff
Score Percentile
Filter out candidates
below percentile cutoff
Score quotient cutoff = 0.5
Keep candidates scoring at least half of max score
MS2 Spectrum 1
Candidate Scores
Score percentile cutoff = 0.5
Keep the top 50% of candidates
MS2 Spectrum 1
Candidate Scores
MS2 Spectrum 2
Candidate Scores
MS2 Spectrum 2
Candidate Scores
-------
Applying Cut-off Filters to Data
CFM-ID Score
Maximum Score
Score Quotient
Candidate
Compound 1
0.5
0.5
1
Candidate
Compound 2
o
o
Ln
o
bo
Candidate
Compound 3
0.39
0.5
0.78
Candidate
Compound 4
0.1 0.5 0.2
Candidate
Compound 5
0.05
0.5
0.1
i
+->
.Qj
+3
O
3
a
S!
o
o
0
-------
Applying Cut-off Filters to Data
0.5 0.5 1
0.4 0.5 0.8
0.39 0.5 0.78
0.1 0.5 0.2
0.05 0.5 0.1
^ True Compound
0 Other Candidate Compounds
True Positives
False Negatives
True Negatives
False Positives
-------
Applying Cut-off Filters to Data
CFM-ID Score
Maximum Score
Score Quotient
Candidate
Compound 1
0.5
0.5
1
Candidate
Compound 2
0.4
0.5
0.8
Candidate
Compound 3
0.39
0.5
0.78
Candidate
Compound 4
0.1
0.5
0.2
Candidate
Compound 5
0.05
0.5
0.1
True Compound
Other Candidate Compounds
True Positives
1
False Negatives
0
True Negatives
0
False Positives
4
Score Quotient
Cut-off = 0
-------
Applying Cut-off Filters to Data
CFM-ID Score
Maximum Score
Score Quotient
Candidate
Compound 1
0.5
0.5
1
Candidate
Compound 2
0.4
0.5
0.8
Candidate
Compound 3
0.39
0.5
0.78
Candidate
Compound 4
0.1
0.5
0.2
Candidate
Compound 5
0.05
0.5
0.1
True Compound
Other Candidate Compounds
True Positives
1
False Negatives
0
True Negatives
2
False Positives
2
Score Quotient
Cut-off = 0.5
-------
Applying Cut-off Filters to Data
CFM-ID Score
Maximum Score
Score Quotient
Candidate
Compound 1
0.5
0.5
1
Candidate
Compound 2
0.4
0.5
0.8
Candidate
Compound 3
0.39
0.5
0.78
Candidate
Compound 4
0.1
0.5
0.2
Candidate
Compound 5
0.05
0.5
0.1
True Compound
Other Candidate Compounds
True Positives
0
False Negatives
1
True Negatives
3
False Positives
1
Score Quotient
Cut-off = 0.9
-------
Balancing Cut-offs
TP
True Positive Rate (TPR) = ———
v 7 TP + FN
FP
False Positive Rate (FPR) = ———
v 7 FP + TN
How many of the true
compounds are we keeping
How much of the junk are
we getting rid of?
-------
Quotient Vs. Percentile Cutoffs
0
-I—I
CO
0
>
"tfi
o
CL
CD
3
Global ROC Curves (All ENTACT Mixtures
10
0.8
0.6
0.4
0.2
0.0
i i i i i
0.0 0.2 0.4 0.6 0.8
False Positive Rate
Quotient (by formula)
Percentile (by formula)
Quotient (by mass)
Percentile (by mass)
-------
Quotient Vs. Percentile Cutoffs
0
-I—I
CO
0
>
"tfi
o
CL
CD
3
Global ROC Curves (All ENTACT Mixtures
10
0.8
0.6
0.4
0.2
0.0
Quotient (by formula)
Percentile (by formula)
Quotient (by mass)
Percentile (by mass)
False Positive Rate
c;
„cu
o
a
QJ
O
u
to
0 Score Quotient Cut-off
A. True Compounds
# Other Candidate Compounds
-------
Quotient Vs. Percentile Cutoffs
Global ROC Curves (All ENTACT Mixtures
10 -
0.8 -
0.6 -
0.4 -
0.2 -
0.0 -
—i—
0.0
Quotient (by formula)
Percentile (by formula)
Quotient (by mass)
Percentile (by mass)
0.2
0.4
0.6
0.8
False Positive Rate
c;
„cu
+3
Q
3
a
oj Score Quotient Cut-off
o
u
to
0
A True Compounds
# Other Candidate Compounds
-------
Quotient Vs. Percentile Cutoffs
0
—I
CTJ
0
>
'w
o
CL
CD
3
Global ROC Curves (All ENTACT Mixtures
10
0.8
0.6
0.4
0.2
0.0
Quotient (by formula)
Percentile (by formula)
Quotient (by mass)
Percentile (by mass)
False Positive Rate
„cu
§ w Score Quotient Cut-off
a
QJ
O
u
to
0
A True Compounds
# Other Candidate Compounds
-------
Quotient Vs. Percentile Cutoffs
0
—I
CTJ
0
>
'w
o
CL
CD
3
Global ROC Curves (All ENTACT Mixtures
10
0.8
0.6
0.4
0.2
0.0
Quotient (by formula)
Percentile (by formula)
Quotient (by mass)
Percentile (by mass)
False Positive Rate
Cut-off Values for Global TPR = 0.9
Cut-off value
Quotient (by formula)
0.18
Percentile (by formula)
38
Quotient (by mass)
0.13
Percentile (by mass)
32
Apply to
individual
ENTACT mixtures
-------
CFM-ID Cut-off Filtering: Individual ENTACT Mixtures
0)
CD
1.0
0.8
0.6
0.4
0.2
0.0
&
to1
gP
&
-------
Experimental Acquisition
Database/Library Matching
Data Analysis
r '
Sample Extracts
L. .
1
©
r
LC-QTOF/MS
L. .
1
©
f 1
©
r
MS2
Acquisition
MSI
Acquisition
©
4©
©©©.
DSSTox Database1
(~875,000 Substances)
J©
DSSTox MS-Ready Structures2
w
DSSTox MS-Ready Formulae
1©
Agilent PCDL.csv
(DSSTox MS-Ready Formulae)
©
MS2 Acquisition .d Files
L 1
1
© ,
Agilent PCDL
(Reference MS2 Spectra)
/'-N
r ^
MS2 Exported .mgf Files
CFM-ID Database4
(in silico MS2 Spectra)
©
MSI Feature Table
T5
Filtered Feature Table
lo
Chemical Candidate Table
(Dashboard MetaData)
Aggregated Match Table
3©
PCDL Results Table
(Manually Reviewed)
©
CFM-ID Results Table
(Percentiles & Quotients)
o
o
I—
66
CD
ro
£
*+-
o
-
Q. «
to ^3
cc
Sobus et al. https://link.springer.com/article/10.1007%2Fs00216-018-1526-4
Newton et al. https://www.sciencedirect.com/science/article/pii/S026974911732691X7via%3Dihub
Hedgespeth et al. https://www.sciencedirect.com/science/article/pii/S004896971933298X7via%3Dihub
1Grulke et al. https://www.sciencedirect.com/science/article/pii/S2468111319300234
2McEachran et al. https://icheminf.biomedcentral.com/articles/10.1186/sl3321-018-0299-2
3Williams et al. https://icheminf.biomedcentral.com/articles/10.1186/sl3321-017-0247-6
4McEachran et al. https://www.nature.com/articles/s41597-019-0145-z
-------
Take-away Messages
¦ EPA/ORD NTA activities:
• Focused on applications
- qualitative (to date)-> semi-quantitative (soon)
- must support HT exposure prediction & risk evaluation
• R&D required to support applications
- Experimental + cheminformatic + computational efforts = Viable NTA program
• Growing capacity with new instrumentation
- Requires flexible workflows
° Work smarter, not harder
° Don't reinvent the wheel
° Build once, use many (A. Williams)
xvEPA
United States
Environmental Protection
Agency
-------
&EPA
United States
Environmental Protection
Agency
Contributing Researchers
This work was
supported, in
part, by ORD's
Pathfinder
Innovation
Program (PIP)
and an ORD
EMVL award
EPA ORD
Hussein Al-Ghoul*
Alex Chao*
J a rod Grossman*
Kristin Isaacs
Sarah Laughlin*
Charles Lowe
James McCord
Jeff Minucci
Seth Newton
Katherine Phillips
Tom Purucker
Randolph Singh*
Mark Strynar
Elin Ulrich
* = ORISE/ORAU
EPA ORD (cont.)
Chris Grulke
Kamel Mansouri*
Andrew McEachran*
Ann Richard
John Wambaugh
Antony Williams
Agilent
Jarod Grossman
Andrew McEachran
GDI!
Ilya Balabin
Tom Transue
Tommy Cathey
46
Office of Research and Development
-------
oEPA
United States
Environmental Protection
Agency
Questions?
# 52 vo
I® I
% r
------- |