DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: The METLIN small molecule dataset for machine learning-based retention time prediction

Abstract

Machine learning has been extensively applied in small molecule analysis to predict a wide range of molecular properties and processes including mass spectrometry fragmentation or chromatographic retention time. However, current approaches for retention time prediction lack sufficient accuracy due to limited available experimental data. Here we introduce the METLIN small molecule retention time (SMRT) dataset, an experimentally acquired reverse-phase chromatography retention time dataset covering up to 80,038 small molecules. To demonstrate the utility of this dataset, we deployed a deep learning model for retention time prediction applied to small molecule annotation. Results showed that in 70% of the cases, the correct molecular identity was ranked among the top 3 candidates based on their predicted retention time. We anticipate that this dataset will enable the community to apply machine learning or first principles strategies to generate better models for retention time prediction.

Authors:
ORCiD logo [1];  [2]; ORCiD logo [2]; ORCiD logo [2];  [2];  [2];  [3]; ORCiD logo [2]; ORCiD logo [4]
  1. The Scripps Research Inst., La Jolla, CA (United States). Scripps Center for Metabolomics
  2. The Scripps Research Inst., La Jolla, CA (United States). Scripps Center for Metabolomics
  3. The Scripps Research Inst., La Jolla, CA (United States). California Institute for Biomedical Research (Calibr)
  4. The Scripps Research Inst., La Jolla, CA (United States). Scripps Center for Metabolomics The Scripps Research Inst., La Jolla, CA (United States). Department of Integrative Structural and Computational Biology
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Biological and Environmental Research (BER)
OSTI Identifier:
1624228
Grant/Contract Number:  
AC02-05CH11231; R35GM130385; P30 MH062261; P01 DA026146; U01 CA235493
Resource Type:
Accepted Manuscript
Journal Name:
Nature Communications
Additional Journal Information:
Journal Volume: 10; Journal Issue: 1; Journal ID: ISSN 2041-1723
Publisher:
Nature Publishing Group
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; Science & Technology - Other Topics

Citation Formats

Domingo-Almenara, Xavier, Guijas, Carlos, Billings, Elizabeth, Montenegro-Burke, J. Rafael, Uritboonthai, Winnie, Aisporna, Aries E., Chen, Emily, Benton, H. Paul, and Siuzdak, Gary. The METLIN small molecule dataset for machine learning-based retention time prediction. United States: N. p., 2019. Web. doi:10.1038/s41467-019-13680-7.
Domingo-Almenara, Xavier, Guijas, Carlos, Billings, Elizabeth, Montenegro-Burke, J. Rafael, Uritboonthai, Winnie, Aisporna, Aries E., Chen, Emily, Benton, H. Paul, & Siuzdak, Gary. The METLIN small molecule dataset for machine learning-based retention time prediction. United States. https://doi.org/10.1038/s41467-019-13680-7
Domingo-Almenara, Xavier, Guijas, Carlos, Billings, Elizabeth, Montenegro-Burke, J. Rafael, Uritboonthai, Winnie, Aisporna, Aries E., Chen, Emily, Benton, H. Paul, and Siuzdak, Gary. Fri . "The METLIN small molecule dataset for machine learning-based retention time prediction". United States. https://doi.org/10.1038/s41467-019-13680-7. https://www.osti.gov/servlets/purl/1624228.
@article{osti_1624228,
title = {The METLIN small molecule dataset for machine learning-based retention time prediction},
author = {Domingo-Almenara, Xavier and Guijas, Carlos and Billings, Elizabeth and Montenegro-Burke, J. Rafael and Uritboonthai, Winnie and Aisporna, Aries E. and Chen, Emily and Benton, H. Paul and Siuzdak, Gary},
abstractNote = {Machine learning has been extensively applied in small molecule analysis to predict a wide range of molecular properties and processes including mass spectrometry fragmentation or chromatographic retention time. However, current approaches for retention time prediction lack sufficient accuracy due to limited available experimental data. Here we introduce the METLIN small molecule retention time (SMRT) dataset, an experimentally acquired reverse-phase chromatography retention time dataset covering up to 80,038 small molecules. To demonstrate the utility of this dataset, we deployed a deep learning model for retention time prediction applied to small molecule annotation. Results showed that in 70% of the cases, the correct molecular identity was ranked among the top 3 candidates based on their predicted retention time. We anticipate that this dataset will enable the community to apply machine learning or first principles strategies to generate better models for retention time prediction.},
doi = {10.1038/s41467-019-13680-7},
journal = {Nature Communications},
number = 1,
volume = 10,
place = {United States},
year = {Fri Dec 20 00:00:00 EST 2019},
month = {Fri Dec 20 00:00:00 EST 2019}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 82 works
Citation information provided by
Web of Science

Figures / Tables:

Fig. 1 Fig. 1: RT prediction results. a Composition of the SMRT dataset and structure of the deep-learning model. b Predicted vs experimental RT for the training set and c validation set. Non-retained molecules are indicated (tentatively) in the training set plot. The relative prediction error box plot for the validation setmore » is also shown. The box plot represents median value and interquartile range (25–75% percentiles) excluding outliers.« less

Save / Share:

Works referenced in this record:

CSM-lig: a web server for assessing and comparing protein–small molecule affinities
journal, May 2016

  • Pires, Douglas E. V.; Ascher, David B.
  • Nucleic Acids Research, Vol. 44, Issue W1
  • DOI: 10.1093/nar/gkw390

UPLC–MS retention time prediction: a machine learning approach to metabolite identification in untargeted profiling
journal, November 2015


Mind the Gap: Mapping Mass Spectral Databases in Genome-Scale Metabolic Networks Reveals Poorly Covered Areas
journal, September 2018

  • Frainay, Clément; Schymanski, Emma; Neumann, Steffen
  • Metabolites, Vol. 8, Issue 3
  • DOI: 10.3390/metabo8030051

MoleculeNet: a benchmark for molecular machine learning
journal, January 2018

  • Wu, Zhenqin; Ramsundar, Bharath; Feinberg, Evan N.
  • Chemical Science, Vol. 9, Issue 2
  • DOI: 10.1039/C7SC02664A

Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set
journal, August 2017

  • Lenselink, Eelke B.; ten Dijke, Niels; Bongers, Brandon
  • Journal of Cheminformatics, Vol. 9, Issue 1
  • DOI: 10.1186/s13321-017-0232-0

Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking
journal, August 2016

  • Wang, Mingxun; Carver, Jeremy J.; Phelan, Vanessa V.
  • Nature Biotechnology, Vol. 34, Issue 8
  • DOI: 10.1038/nbt.3597

Development and application of retention time prediction models in the suspect and non-target screening of emerging contaminants
journal, February 2019


Extended-Connectivity Fingerprints
journal, April 2010

  • Rogers, David; Hahn, Mathew
  • Journal of Chemical Information and Modeling, Vol. 50, Issue 5
  • DOI: 10.1021/ci100050t

Review on modelling aspects in reversed-phase liquid chromatographic quantitative structure–retention relationships
journal, October 2007


CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra
journal, June 2014

  • Allen, Felicity; Pon, Allison; Wilson, Michael
  • Nucleic Acids Research, Vol. 42, Issue W1
  • DOI: 10.1093/nar/gku436

Toward Global Metabolomics Analysis with Hydrophilic Interaction Liquid Chromatography–Mass Spectrometry: Improved Metabolite Identification by Retention Time Prediction
journal, November 2011

  • Creek, Darren J.; Jankevics, Andris; Breitling, Rainer
  • Analytical Chemistry, Vol. 83, Issue 22
  • DOI: 10.1021/ac2021823

CFM-ID 3.0: Significantly Improved ESI-MS/MS Prediction and Compound Identification
journal, April 2019

  • Djoumbou-Feunang, Yannick; Pon, Allison; Karu, Naama
  • Metabolites, Vol. 9, Issue 4
  • DOI: 10.3390/metabo9040072

SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information
journal, March 2019


Mass spectral databases for LC/MS- and GC/MS-based metabolomics: State of the field and future prospects
journal, April 2016

  • Vinaixa, Maria; Schymanski, Emma L.; Neumann, Steffen
  • TrAC Trends in Analytical Chemistry, Vol. 78
  • DOI: 10.1016/j.trac.2015.09.005

Retention time prediction for dereplication of natural products (CxHyOz) in LC–MS metabolite profiling
journal, December 2014


A Simple Representation of Three-Dimensional Molecular Structure
journal, August 2017


Molecular fingerprint similarity search in virtual screening
journal, January 2015


METLIN: A Technology Platform for Identifying Knowns and Unknowns
journal, January 2018

  • Guijas, Carlos; Montenegro-Burke, J. Rafael; Domingo-Almenara, Xavier
  • Analytical Chemistry, Vol. 90, Issue 5
  • DOI: 10.1021/acs.analchem.7b04424

Liquid-chromatography retention order prediction for metabolite identification
journal, September 2018


Autonomous METLIN-Guided In-source Fragment Annotation for Untargeted Metabolomics
journal, January 2019

  • Domingo-Almenara, Xavier; Montenegro-Burke, J. Rafael; Guijas, Carlos
  • Analytical Chemistry, Vol. 91, Issue 5
  • DOI: 10.1021/acs.analchem.8b03126

XCMS-MRM and METLIN-MRM: a cloud library and public resource for targeted analysis of small molecules
journal, August 2018

  • Domingo-Almenara, Xavier; Montenegro-Burke, J. Rafael; Ivanisevic, Julijana
  • Nature Methods, Vol. 15, Issue 9
  • DOI: 10.1038/s41592-018-0110-3

Structure Annotation of All Mass Spectra in Untargeted Metabolomics
journal, January 2019


Comprehensive and Empirical Evaluation of Machine Learning Algorithms for Small Molecule LC Retention Time Prediction
journal, January 2019


Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks
journal, March 2019


Retention Time Prediction Improves Identification in Nontargeted Lipidomics Approaches
journal, July 2015


Improved Peptide Retention Time Prediction in Liquid Chromatography through Deep Learning
journal, August 2018


ClassyFire: automated chemical classification with a comprehensive, computable taxonomy
journal, November 2016

  • Djoumbou Feunang, Yannick; Eisner, Roman; Knox, Craig
  • Journal of Cheminformatics, Vol. 8, Issue 1
  • DOI: 10.1186/s13321-016-0174-y

DeepTox: Toxicity Prediction using Deep Learning
journal, February 2016

  • Mayr, Andreas; Klambauer, Günter; Unterthiner, Thomas
  • Frontiers in Environmental Science, Vol. 3
  • DOI: 10.3389/fenvs.2015.00080

Proposed minimum reporting standards for chemical analysis: Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI)
journal, September 2007


HMDB 4.0: the human metabolome database for 2018
journal, November 2017

  • Wishart, David S.; Feunang, Yannick Djoumbou; Marcu, Ana
  • Nucleic Acids Research, Vol. 46, Issue D1
  • DOI: 10.1093/nar/gkx1089

PDB-wide collection of binding data: current status of the PDBbind database
journal, October 2014


Software Tools and Approaches for Compound Identification of LC-MS/MS Data in Metabolomics
journal, May 2018


Enabling Efficient and Confident Annotation of LC−MS Metabolomics Data through MS1 Spectrum and Time Prediction
journal, September 2016


The use of LC predicted retention times to extend metabolites identification with SWATH data acquisition
journal, December 2017


The rise of deep learning in drug discovery
journal, June 2018


PredRet: Prediction of Retention Time by Direct Mapping between Multiple Chromatographic Systems
journal, August 2015


Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?
journal, May 2015

  • Bajusz, Dávid; Rácz, Anita; Héberger, Károly
  • Journal of Cheminformatics, Vol. 7, Issue 1
  • DOI: 10.1186/s13321-015-0069-3

Kernel-Based, Partial Least Squares Quantitative Structure-Retention Relationship Model for UPLC Retention Time Prediction: A Useful Tool for Metabolite Identification
journal, September 2016


Retention projection enables accurate calculation of liquid chromatographic retention times across labs and methods
journal, September 2015


Machine learning in chemoinformatics and drug discovery
journal, August 2018


Annotation: A Computational Solution for Streamlining Metabolomics Analysis
journal, November 2017

  • Domingo-Almenara, Xavier; Montenegro-Burke, J. Rafael; Benton, H. Paul
  • Analytical Chemistry, Vol. 90, Issue 1
  • DOI: 10.1021/acs.analchem.7b03929

Evaluation of an Artificial Neural Network Retention Index Model for Chemical Structure Identification in Nontargeted Metabolomics
journal, October 2018


DeepTox: Toxicity prediction using deep learning
journal, October 2017


Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking
text, January 2016

  • M., Waters, Katrina; Yao, Peng,; L., McPhail, Kerry
  • The University of North Carolina at Chapel Hill University Libraries
  • DOI: 10.17615/dd92-mf79

Beyond the Hype: Deep Neural Networks Outperform Established Methods Using A ChEMBL Bioactivity Benchmark Set
dataset, January 2019


A Simple Representation of Three-Dimensional Molecular Structure
text, January 2017

  • M. J., Keiser,; S. D., Axen,; B. L., Roth,
  • American Chemical Society
  • DOI: 10.17615/paa0-1p40

MoleculeNet: A Benchmark for Molecular Machine Learning
preprint, January 2017


Review on modelling aspects in reversed-phase liquid chromatographic quantitative structure–retention relationships
journal, October 2007


Retention projection enables accurate calculation of liquid chromatographic retention times across labs and methods
journal, September 2015


The use of LC predicted retention times to extend metabolites identification with SWATH data acquisition
journal, December 2017


Development and application of retention time prediction models in the suspect and non-target screening of emerging contaminants
journal, February 2019


Molecular fingerprint similarity search in virtual screening
journal, January 2015


Toward Global Metabolomics Analysis with Hydrophilic Interaction Liquid Chromatography–Mass Spectrometry: Improved Metabolite Identification by Retention Time Prediction
journal, November 2011

  • Creek, Darren J.; Jankevics, Andris; Breitling, Rainer
  • Analytical Chemistry, Vol. 83, Issue 22
  • DOI: 10.1021/ac2021823

METLIN: A Technology Platform for Identifying Knowns and Unknowns
journal, January 2018

  • Guijas, Carlos; Montenegro-Burke, J. Rafael; Domingo-Almenara, Xavier
  • Analytical Chemistry, Vol. 90, Issue 5
  • DOI: 10.1021/acs.analchem.7b04424

Improved Peptide Retention Time Prediction in Liquid Chromatography through Deep Learning
journal, August 2018


Autonomous METLIN-Guided In-source Fragment Annotation for Untargeted Metabolomics
journal, January 2019

  • Domingo-Almenara, Xavier; Montenegro-Burke, J. Rafael; Guijas, Carlos
  • Analytical Chemistry, Vol. 91, Issue 5
  • DOI: 10.1021/acs.analchem.8b03126

Structure Annotation of All Mass Spectra in Untargeted Metabolomics
journal, January 2019


Comprehensive and Empirical Evaluation of Machine Learning Algorithms for Small Molecule LC Retention Time Prediction
journal, January 2019


Extended-Connectivity Fingerprints
journal, April 2010

  • Rogers, David; Hahn, Mathew
  • Journal of Chemical Information and Modeling, Vol. 50, Issue 5
  • DOI: 10.1021/ci100050t

DIAMetAlyzer allows automated false-discovery rate-controlled analysis for data-independent acquisition in metabolomics
journal, March 2022


XCMS-MRM and METLIN-MRM: a cloud library and public resource for targeted analysis of small molecules
journal, August 2018

  • Domingo-Almenara, Xavier; Montenegro-Burke, J. Rafael; Ivanisevic, Julijana
  • Nature Methods, Vol. 15, Issue 9
  • DOI: 10.1038/s41592-018-0110-3

Liquid-chromatography retention order prediction for metabolite identification
journal, September 2018


CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra
journal, June 2014

  • Allen, Felicity; Pon, Allison; Wilson, Michael
  • Nucleic Acids Research, Vol. 42, Issue W1
  • DOI: 10.1093/nar/gku436

HMDB 4.0: the human metabolome database for 2018
journal, November 2017

  • Wishart, David S.; Feunang, Yannick Djoumbou; Marcu, Ana
  • Nucleic Acids Research, Vol. 46, Issue D1
  • DOI: 10.1093/nar/gkx1089

DeepTox: Toxicity Prediction using Deep Learning
journal, February 2016

  • Mayr, Andreas; Klambauer, Günter; Unterthiner, Thomas
  • Frontiers in Environmental Science, Vol. 3
  • DOI: 10.3389/fenvs.2015.00080

Software Tools and Approaches for Compound Identification of LC-MS/MS Data in Metabolomics
journal, May 2018


Mind the Gap: Mapping Mass Spectral Databases in Genome-Scale Metabolic Networks Reveals Poorly Covered Areas
journal, September 2018

  • Frainay, Clément; Schymanski, Emma; Neumann, Steffen
  • Metabolites, Vol. 8, Issue 3
  • DOI: 10.3390/metabo8030051

MoleculeNet: A Benchmark for Molecular Machine Learning
preprint, January 2017


Works referencing / citing this record:

Machine Learning Applications for Mass Spectrometry-Based Metabolomics
journal, June 2020


Machine Learning Applications for Mass Spectrometry-Based Metabolomics
text, January 2020


Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.