xvEPA United States Environmental Protection Agency Predicting Chromatography-tandem Mass Spectrometry Amenability to Improve Non-targeted Analysis Charles Lowe1, Kristin Isaacs1, Chris Grulke1, Jon Sobus1, Elin Ulrich1, Alex Chao1'2, and Antony J. Williams1 1. Center for Computational Toxicology and Exposure, U.S. Environmental Protection Agency, Research Triangle Park, NC 2. Oak Ridge Institute of Science and Education (ORISE) Research Participant, Research Triangle Park, NC Office of Research and Development Center for Computational Toxicology and Exposure ------- SEPA United States Environmental Protection Agency Disclaimer: The views expressed in this presentation are those of the authors and do not necessarily reflect the views or policies of the U.S. Environmental Protection Agency. This presentation has not been reviewed for policy and is not for distribution. Office of Research and Development Center for Computational Toxicology and Exposure ------- ??? What are we trying to model? United States Environmental Protection Agency Mass Spectrum 28 Mai .ill.. 1202224 147 15^52 ¦ ¦i I ¦». https://images.app.goo.gl/ftRmhwxEtZs95uKv7 ------- &EPA United States Environmental Protection Agency 120 100 60 "S 40 What are we trying to model? For more details on NTA at EPA, please see: EPA's research initiatives on non-targeted analyses of environmental chemicals PRESENTER: Jon Sobus PAPER ID: 3428870 T5 X. 28 2#Y«ai .ill.. a? liKJ °.il |YpZ 73J8^| 12^2224 1323 II 140 147 15^52 ..III.. 111 ¦ ¦ 20 40 60 80 100 120 140 160 180 200 m/z https://images.app.goo.gl/ftRmhwxEtZs95uKv7 ------- &EPA United States Environmental Protection Agency The more data, the better (most of the t me..) ¦ MoNA - MassBank of North America LiilL Spectra » £% Downloads A Upload ©Help-' Searck.. A Downloads A set of commonly referenced predefined queries. Clicking the name of the query will display the associated spectra in the query browser. Each query is also available to download in either the MoNA internal JSON format or as NIST MS Search compatible MSP files. U Display Hidden Downloads Q. All Spectra (659,728 spectra) £ Download 0 Q In-Silico Spectra (490,087 spectra) & Download 0 Q. Experimental Spectra (169,641 spectra) Download 0 Q GC-MS Spectra (18,883 spectra) £ Download & Q. LC'-MS Spectra (133,301 spectra) Download & Q, LC-MSMS Spectra (125,833 spectra) i Download [1Q LC-MS MS Positive Mode (86,576 spectra) Download 0Q. LC-MS MS Negative Mode (38,475 spectra) Download 772 compounds in derivatized GCMS 7,199 compounds in non-derivatized GCMS 3,549 compounds in ESI+ LCMS 2,630 compounds in ESI- LCMS Office of Research and Development Center for Computational Toxicology and Exposure ------- &EPA United States Environmental Protection Agency Caffeine Caffeine HqC. Originally submitted to the MassBank High Quality Mass Spectral Database HC, A Q, instrument type Q instrument Q collision energy Q, ionization Q ionization mode Q ms level Q precursor m z Q precursor type Q accession Q publication Originally submitted to the RIKEN MS 11 Spectral Database for Phytochetnicals Score: ~ ~ ~ ~ ~ Q instrument Pegasus EI TOF-MS system. Q instrument type GC-EI-TOF Q ms level MSI Q retention index 1880.2430 Q retention time 724.344 sec Q ionization mode positive Q accession OUF00133 Q date 20ie.01.19 (Created 2010.... Q, author Tsujimoto Y Tsugawa H= B.. Q license CC BY-SA Score: "A" ^ ik QqQ Micromass Quattromicro 15eV ESI positive MS2 194.9000 [M-H]+ PM018511 Alonso-Salces KM, Guillou ------- S rpA Describi ng structures for modeling United States Environmental Protection Agency Software News and Update PaDEL-Descriptor: An Open Source Software to Calculate Molecular Descriptors and Fingerprints CHUN WEI YAP Department of Pharmacy, Pharmaceutical Data Exploration Laboratory, National University of Singapore. Singapore Received 17 May 2010; Revised 22 August 2010; Accepted 12 October 2010 DOl 10.I002!jcc.2J707 Published online 17 December 2010 in Wiley Online Library (wiley online library.com). 1,444 ID & 2D Molecular descriptors from QSAR-ready SMILES. Examples include. -Electrotopological state -McGowan volume (van der Waals volume) -molecular linear free energy relationships -Atom, bond, & ring counts -LogP predictions, etc.. Office of Research and Development Center for Computational Toxicology and Exposure ------- S pp/y Reduction of descriptor space United States Environmental Protection Agency Dimension reduction will improve our models and make calculations quicker 1. Remove any constant descriptors (variance(x) = 0) 2. Remove nearly constant descriptors (SD < 0.25) - 0.25 gives a good balance between reduction and retention 3. Calculate pair-wise correlations between remaining descriptors - Eliminate based on a cutoff = 0.96 correlation 1,444 descriptors -> 385 descriptors Office of Research and Development Center for Computational Toxicology and Exposure ------- &EPA United States Environmental Protection Agency Datasets suitable for modeling Models need both training and test data 75% of data for training, 25% for testing -Data stratified to maintain proportions in outcome variable -Different for each model -InChlKey skeleton as identifier External validation datasets -EPA's NTA Collaborative Trial (ENTACT) data (explicitly removed from train/test sets) i brary(readxl) i brary(caret) i brary(randomForest) i brary(funModeli ng) i brary(ti dyverse) i brary(GA) i braryCAdaSampli ng) i braryCwsrf) i brary(rsample) i brary(dbscan) R libraries used in study 8 Office of Research and Development Center for Computational Toxicology and Exposure ------- s CDA Learning approach United States Environmental Protection Agency Four models -GC (derivatized), GC (not derivatized) ESI+ LC, ESI- LC Random forest (will explain) -Downsample absence data to match count of presence data -Optimize mtry and ntree via grid search -5-fold cross validation -Y-randomization 9 Office of Research and Development Center for Computational Toxicology and Exposure Random Forest Simplified Random Forest Tree-1 Instance Tree-2 Class-A Class-B i Majority-Voting Final-Class ------- &EPA United States Environmental Protection Agency Choosing the correct descriptors to pred ct the endpoint Random Forest Algorithm Training set X = x1x2...xn with responses Y=yiy2-yn For b = 1. Sample, with replacement, training examples from X, Y; Xb, Yb. 2. Train a classification tree/b on Xb, Yb. 3. The majority of all fb classifies unseen samples. Office of Research and Development Center for Computational Toxicology and Exposure Creamer? a Artwork? J Q ------- United States Environmental Protection I " C^wl wl V wl O LQ Agency Classification models need negative data, in addition to positive data -labs do not report chemicals NOT seen, only those identified by the instrument How do we provide a model with negative data? -produce the negative data ourselves (but note it is expensive) -assume all chemicals not present are absent -make assumption(s) as to what WAS tested For now, let's assume that if a chemical is detected in either ESI+/-> then it has also been tested in the other mode Still exploring reasonable assumptions for GCMS 11 Office of Research and Development Center for Computational Toxicology and Exposure ------- S rnft teas, protection Descri ptor I in porta nee Agency 1 MDE0.11 Important descriptor descriptions MDEO-11 -molecular distance edge between all primary oxygens MLFER-A -overall or summation solute hydrogen bond acidity SHsOH & maxHsOH -electrotopological state with respect to -OH fragments nN -the number of N atoms... Office of Research and Development Center for Computational Toxicology and Exposure SHsOH MLFER_A maxHsOH SHBd maxHBd nAcid nsOH minHsOH minsOH ATSCIe SHBint2 maxHBint2 nN minaasC minHBint2 GATS2c minHBd maxsssN minssCH2 GATSIs nHBint2 minaaCH ATSC4I maxHBint4 SHBint4 minHBint4 SssNH ATSC3i ATSC2p nN SHsOH maxHsOH MDE0.11 nAtomLAC nAcid nBase AATSCOi minsOH MLFER_A ATSC2v nAtomLC nsOH nssCH2 hmax MDE0.12 ATSC2p ATSCIe ATSCIm SssNH nRotBt minHsOH Kier2 SsNH2 maxHBd ATSCIi minaasC SaasC minaaCH AATS3e ------- &EPA United States Environmental Protection Agency Model results ESI+ Not Downsampled Reference Prediction Present Absent Present 2409 252 Absent 291 716 Sensitivity 0.8922 Specificity 0.7397 Balanced Accuracy 0.8159 13 Office of Research and Development Center for Computational Toxicology and Exposure ESI+ Downsampled Reference Prediction Present Absent Present 2273 388 Absent 171 836 Sensitivity 0.9300 Specificity 0.6830 Balanced Accuracy 0.8065 ------- &EPA United States Environmental Protection Agency Model results ESI- Not Downsampled Reference Prediction Present Absent Present 1659 305 Absent 291 1413 Sensitivity 0.8508 Specificity 0.8225 Balanced Accuracy 0.8366 14 Office of Research and Development Center for Computational Toxicology and Exposure ESI- Downsampled Reference Prediction Present Absent Present 1649 315 Absent 271 1433 Sensitivity 0.8589 Specificity 0.8198 Balanced Accuracy 0.8393 ------- Internal test set results United States Environmental Protection Agency ESI+ Not Downsampled Reference Prediction Present Absent Present 804 114 Absent 84 220 Sensitivity 0.9054 Specificity 0.6587 Balanced Accuracy 0.782 15 Office of Research and Development Center for Computational Toxicology and Exposure ESI+ Downsampled Reference Prediction Present Absent Present 767 65 Absent 121 269 Sensitivity 0.8637 Specificity 0.8054 Balanced Accuracy 0.8346 ------- Internal test set results United States Environmental Protection Agency ESI- Not Downsampled Reference Prediction Present Absent Present 551 104 Absent 115 452 Sensitivity 0.8273 Specificity 0.8129 Balanced Accuracy 0.8201 16 Office of Research and Development Center for Computational Toxicology and Exposure ESI- Downsampled Reference Prediction Present Absent Present 545 92 Absent 121 464 Sensitivity 0.8183 Specificity 0.8345 Balanced Accuracy 0.8264 ------- Current & future work United States Environmental Protection Agency Comparing model results to ENTACT results -comparing predictions against independent labs, consensus of labs Considering new metrics for model quality -balanced accuracy not ideal when negative data may contain false negatives Applicability domains for models under development -global and local measures Working with collaborators to improve available data -data from additional potential collaborators would be GREATLY appreciated 17 Office of Research and Development Center for Computational Toxicology and Exposure ------- &EPA United States Environmental Protection Agency Contributing researchers I S v° Credit: the Research Triangle Foundation Office of Research and Development Center for Computational Toxicology and Exposure EPA ORD Hussein Al-Ghoul* Alex Chao* Louis Groff* J a rod Grossman* Chris Grulke Kristin Isaacs Sarah Laughlin* Jon Sobus Kamel Mansouri* James McCord Andrew McEachran* Jeff Minucci Seth Newton Katherine Phillips EPA ORD (cont.) Tom Purucker Ann Richard Randolph Singh* Mark Strynar Elin Ulrich John Wambaugh Antony Williams GDiT llya Balabin Tom Transue Tommy Cathey * = ORISE/ORAU ------- Thank you for Listening! ------- ------- |