The METLIN small molecule dataset for machine learning-based retention time prediction
Abstract
Machine learning has been extensively applied in small molecule analysis to predict a wide range of molecular properties and processes including mass spectrometry fragmentation or chromatographic retention time. However, current approaches for retention time prediction lack sufficient accuracy due to limited available experimental data. Here we introduce the METLIN small molecule retention time (SMRT) dataset, an experimentally acquired reverse-phase chromatography retention time dataset covering up to 80,038 small molecules. To demonstrate the utility of this dataset, we deployed a deep learning model for retention time prediction applied to small molecule annotation. Results showed that in 70% of the cases, the correct molecular identity was ranked among the top 3 candidates based on their predicted retention time. We anticipate that this dataset will enable the community to apply machine learning or first principles strategies to generate better models for retention time prediction.
- Authors:
-
- The Scripps Research Inst., La Jolla, CA (United States). Scripps Center for Metabolomics
- The Scripps Research Inst., La Jolla, CA (United States). Scripps Center for Metabolomics
- The Scripps Research Inst., La Jolla, CA (United States). California Institute for Biomedical Research (Calibr)
- The Scripps Research Inst., La Jolla, CA (United States). Scripps Center for Metabolomics The Scripps Research Inst., La Jolla, CA (United States). Department of Integrative Structural and Computational Biology
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Biological and Environmental Research (BER)
- OSTI Identifier:
- 1624228
- Grant/Contract Number:
- AC02-05CH11231; R35GM130385; P30 MH062261; P01 DA026146; U01 CA235493
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Nature Communications
- Additional Journal Information:
- Journal Volume: 10; Journal Issue: 1; Journal ID: ISSN 2041-1723
- Publisher:
- Nature Publishing Group
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; Science & Technology - Other Topics
Citation Formats
Domingo-Almenara, Xavier, Guijas, Carlos, Billings, Elizabeth, Montenegro-Burke, J. Rafael, Uritboonthai, Winnie, Aisporna, Aries E., Chen, Emily, Benton, H. Paul, and Siuzdak, Gary. The METLIN small molecule dataset for machine learning-based retention time prediction. United States: N. p., 2019.
Web. doi:10.1038/s41467-019-13680-7.
Domingo-Almenara, Xavier, Guijas, Carlos, Billings, Elizabeth, Montenegro-Burke, J. Rafael, Uritboonthai, Winnie, Aisporna, Aries E., Chen, Emily, Benton, H. Paul, & Siuzdak, Gary. The METLIN small molecule dataset for machine learning-based retention time prediction. United States. https://doi.org/10.1038/s41467-019-13680-7
Domingo-Almenara, Xavier, Guijas, Carlos, Billings, Elizabeth, Montenegro-Burke, J. Rafael, Uritboonthai, Winnie, Aisporna, Aries E., Chen, Emily, Benton, H. Paul, and Siuzdak, Gary. Fri .
"The METLIN small molecule dataset for machine learning-based retention time prediction". United States. https://doi.org/10.1038/s41467-019-13680-7. https://www.osti.gov/servlets/purl/1624228.
@article{osti_1624228,
title = {The METLIN small molecule dataset for machine learning-based retention time prediction},
author = {Domingo-Almenara, Xavier and Guijas, Carlos and Billings, Elizabeth and Montenegro-Burke, J. Rafael and Uritboonthai, Winnie and Aisporna, Aries E. and Chen, Emily and Benton, H. Paul and Siuzdak, Gary},
abstractNote = {Machine learning has been extensively applied in small molecule analysis to predict a wide range of molecular properties and processes including mass spectrometry fragmentation or chromatographic retention time. However, current approaches for retention time prediction lack sufficient accuracy due to limited available experimental data. Here we introduce the METLIN small molecule retention time (SMRT) dataset, an experimentally acquired reverse-phase chromatography retention time dataset covering up to 80,038 small molecules. To demonstrate the utility of this dataset, we deployed a deep learning model for retention time prediction applied to small molecule annotation. Results showed that in 70% of the cases, the correct molecular identity was ranked among the top 3 candidates based on their predicted retention time. We anticipate that this dataset will enable the community to apply machine learning or first principles strategies to generate better models for retention time prediction.},
doi = {10.1038/s41467-019-13680-7},
journal = {Nature Communications},
number = 1,
volume = 10,
place = {United States},
year = {2019},
month = {12}
}
Web of Science
Figures / Tables:

Works referenced in this record:
CSM-lig: a web server for assessing and comparing protein–small molecule affinities
journal, May 2016
- Pires, Douglas E. V.; Ascher, David B.
- Nucleic Acids Research, Vol. 44, Issue W1
UPLC–MS retention time prediction: a machine learning approach to metabolite identification in untargeted profiling
journal, November 2015
- Wolfer, Arnaud M.; Lozano, Sylvain; Umbdenstock, Thierry
- Metabolomics, Vol. 12, Issue 1
Mind the Gap: Mapping Mass Spectral Databases in Genome-Scale Metabolic Networks Reveals Poorly Covered Areas
journal, September 2018
- Frainay, Clément; Schymanski, Emma; Neumann, Steffen
- Metabolites, Vol. 8, Issue 3
MoleculeNet: a benchmark for molecular machine learning
journal, January 2018
- Wu, Zhenqin; Ramsundar, Bharath; Feinberg, Evan N.
- Chemical Science, Vol. 9, Issue 2
A study on retention “projection” as a supplementary means for compound identification by liquid chromatography–mass spectrometry capable of predicting retention with different gradients, flow rates, and instruments
journal, September 2011
- Boswell, Paul G.; Schellenberg, Jonathan R.; Carr, Peter W.
- Journal of Chromatography A, Vol. 1218, Issue 38
Use of dual-filtering to create training sets leading to improved accuracy in quantitative structure-retention relationships modelling for hydrophilic interaction liquid chromatographic systems
journal, July 2017
- Taraji, Maryam; Haddad, Paul R.; Amos, Ruth I. J.
- Journal of Chromatography A, Vol. 1507
Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set
journal, August 2017
- Lenselink, Eelke B.; ten Dijke, Niels; Bongers, Brandon
- Journal of Cheminformatics, Vol. 9, Issue 1
Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking
journal, August 2016
- Wang, Mingxun; Carver, Jeremy J.; Phelan, Vanessa V.
- Nature Biotechnology, Vol. 34, Issue 8
Development and application of retention time prediction models in the suspect and non-target screening of emerging contaminants
journal, February 2019
- Aalizadeh, Reza; Nika, Maria-Christina; Thomaidis, Nikolaos S.
- Journal of Hazardous Materials, Vol. 363
Extended-Connectivity Fingerprints
journal, April 2010
- Rogers, David; Hahn, Mathew
- Journal of Chemical Information and Modeling, Vol. 50, Issue 5
Review on modelling aspects in reversed-phase liquid chromatographic quantitative structure–retention relationships
journal, October 2007
- Put, R.; Vander Heyden, Y.
- Analytica Chimica Acta, Vol. 602, Issue 2
CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra
journal, June 2014
- Allen, Felicity; Pon, Allison; Wilson, Michael
- Nucleic Acids Research, Vol. 42, Issue W1
Toward Global Metabolomics Analysis with Hydrophilic Interaction Liquid Chromatography–Mass Spectrometry: Improved Metabolite Identification by Retention Time Prediction
journal, November 2011
- Creek, Darren J.; Jankevics, Andris; Breitling, Rainer
- Analytical Chemistry, Vol. 83, Issue 22
CFM-ID 3.0: Significantly Improved ESI-MS/MS Prediction and Compound Identification
journal, April 2019
- Djoumbou-Feunang, Yannick; Pon, Allison; Karu, Naama
- Metabolites, Vol. 9, Issue 4
SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information
journal, March 2019
- Dührkop, Kai; Fleischauer, Markus; Ludwig, Marcus
- Nature Methods, Vol. 16, Issue 4
Mass spectral databases for LC/MS- and GC/MS-based metabolomics: State of the field and future prospects
journal, April 2016
- Vinaixa, Maria; Schymanski, Emma L.; Neumann, Steffen
- TrAC Trends in Analytical Chemistry, Vol. 78
Retention time prediction for dereplication of natural products (CxHyOz) in LC–MS metabolite profiling
journal, December 2014
- Eugster, Philippe J.; Boccard, Julien; Debrus, Benjamin
- Phytochemistry, Vol. 108
A Simple Representation of Three-Dimensional Molecular Structure
journal, August 2017
- Axen, Seth D.; Huang, Xi-Ping; Cáceres, Elena L.
- Journal of Medicinal Chemistry, Vol. 60, Issue 17
Molecular fingerprint similarity search in virtual screening
journal, January 2015
- Cereto-Massagué, Adrià; Ojeda, María José; Valls, Cristina
- Methods, Vol. 71
METLIN: A Technology Platform for Identifying Knowns and Unknowns
journal, January 2018
- Guijas, Carlos; Montenegro-Burke, J. Rafael; Domingo-Almenara, Xavier
- Analytical Chemistry, Vol. 90, Issue 5
Liquid-chromatography retention order prediction for metabolite identification
journal, September 2018
- Bach, Eric; Szedmak, Sandor; Brouard, Céline
- Bioinformatics, Vol. 34, Issue 17
Autonomous METLIN-Guided In-source Fragment Annotation for Untargeted Metabolomics
journal, January 2019
- Domingo-Almenara, Xavier; Montenegro-Burke, J. Rafael; Guijas, Carlos
- Analytical Chemistry, Vol. 91, Issue 5
XCMS-MRM and METLIN-MRM: a cloud library and public resource for targeted analysis of small molecules
journal, August 2018
- Domingo-Almenara, Xavier; Montenegro-Burke, J. Rafael; Ivanisevic, Julijana
- Nature Methods, Vol. 15, Issue 9
Structure Annotation of All Mass Spectra in Untargeted Metabolomics
journal, January 2019
- Blaženović, Ivana; Kind, Tobias; Sa, Michael R.
- Analytical Chemistry, Vol. 91, Issue 3
Comprehensive and Empirical Evaluation of Machine Learning Algorithms for Small Molecule LC Retention Time Prediction
journal, January 2019
- Bouwmeester, Robbin; Martens, Lennart; Degroeve, Sven
- Analytical Chemistry, Vol. 91, Issue 5
Retention Index Prediction Using Quantitative Structure–Retention Relationships for Improving Structure Identification in Nontargeted Metabolomics
journal, June 2018
- Wen, Yabin; Amos, Ruth I. J.; Talebi, Mohammad
- Analytical Chemistry, Vol. 90, Issue 15
Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks
journal, March 2019
- Wei, Jennifer N.; Belanger, David; Adams, Ryan P.
- ACS Central Science, Vol. 5, Issue 4
Predicting retention time in hydrophilic interaction liquid chromatography mass spectrometry and its use for peak annotation in metabolomics
journal, September 2014
- Cao, Mingshu; Fraser, Karl; Huege, Jan
- Metabolomics, Vol. 11, Issue 3
Retention Time Prediction Improves Identification in Nontargeted Lipidomics Approaches
journal, July 2015
- Aicheler, Fabian; Li, Jia; Hoene, Miriam
- Analytical Chemistry, Vol. 87, Issue 15
Improved Peptide Retention Time Prediction in Liquid Chromatography through Deep Learning
journal, August 2018
- Ma, Chunwei; Ren, Yan; Yang, Jiarui
- Analytical Chemistry, Vol. 90, Issue 18
ClassyFire: automated chemical classification with a comprehensive, computable taxonomy
journal, November 2016
- Djoumbou Feunang, Yannick; Eisner, Roman; Knox, Craig
- Journal of Cheminformatics, Vol. 8, Issue 1
DeepTox: Toxicity Prediction using Deep Learning
journal, February 2016
- Mayr, Andreas; Klambauer, Günter; Unterthiner, Thomas
- Frontiers in Environmental Science, Vol. 3
Proposed minimum reporting standards for chemical analysis: Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI)
journal, September 2007
- Sumner, Lloyd W.; Amberg, Alexander; Barrett, Dave
- Metabolomics, Vol. 3, Issue 3
HMDB 4.0: the human metabolome database for 2018
journal, November 2017
- Wishart, David S.; Feunang, Yannick Djoumbou; Marcu, Ana
- Nucleic Acids Research, Vol. 46, Issue D1
PDB-wide collection of binding data: current status of the PDBbind database
journal, October 2014
- Liu, Zhihai; Li, Yan; Han, Li
- Bioinformatics, Vol. 31, Issue 3
Software Tools and Approaches for Compound Identification of LC-MS/MS Data in Metabolomics
journal, May 2018
- Blaženović, Ivana; Kind, Tobias; Ji, Jian
- Metabolites, Vol. 8, Issue 2
Enabling Efficient and Confident Annotation of LC−MS Metabolomics Data through MS1 Spectrum and Time Prediction
journal, September 2016
- Broeckling, Corey D.; Ganna, Andrea; Layer, Mark
- Analytical Chemistry, Vol. 88, Issue 18
The use of LC predicted retention times to extend metabolites identification with SWATH data acquisition
journal, December 2017
- Bruderer, Tobias; Varesio, Emmanuel; Hopfgartner, Gérard
- Journal of Chromatography B, Vol. 1071
The rise of deep learning in drug discovery
journal, June 2018
- Chen, Hongming; Engkvist, Ola; Wang, Yinhai
- Drug Discovery Today, Vol. 23, Issue 6
PredRet: Prediction of Retention Time by Direct Mapping between Multiple Chromatographic Systems
journal, August 2015
- Stanstrup, Jan; Neumann, Steffen; Vrhovšek, Urška
- Analytical Chemistry, Vol. 87, Issue 18
Easy and accurate high-performance liquid chromatography retention prediction with different gradients, flow rates, and instruments by back-calculation of gradient and flow rate profiles
journal, September 2011
- Boswell, Paul G.; Schellenberg, Jonathan R.; Carr, Peter W.
- Journal of Chromatography A, Vol. 1218, Issue 38
Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?
journal, May 2015
- Bajusz, Dávid; Rácz, Anita; Héberger, Károly
- Journal of Cheminformatics, Vol. 7, Issue 1
Kernel-Based, Partial Least Squares Quantitative Structure-Retention Relationship Model for UPLC Retention Time Prediction: A Useful Tool for Metabolite Identification
journal, September 2016
- Falchi, Federico; Bertozzi, Sine Mandrup; Ottonello, Giuliana
- Analytical Chemistry, Vol. 88, Issue 19
Quantitative structure–retention relationships models for prediction of high performance liquid chromatography retention time of small molecules: Endogenous metabolites and banned compounds
journal, October 2013
- Goryński, Krzysztof; Bojko, Barbara; Nowaczyk, Alicja
- Analytica Chimica Acta, Vol. 797
Retention projection enables accurate calculation of liquid chromatographic retention times across labs and methods
journal, September 2015
- Abate-Pella, Daniel; Freund, Dana M.; Ma, Yan
- Journal of Chromatography A, Vol. 1412
Machine learning in chemoinformatics and drug discovery
journal, August 2018
- Lo, Yu-Chen; Rensi, Stefano E.; Torng, Wen
- Drug Discovery Today, Vol. 23, Issue 8
Annotation: A Computational Solution for Streamlining Metabolomics Analysis
journal, November 2017
- Domingo-Almenara, Xavier; Montenegro-Burke, J. Rafael; Benton, H. Paul
- Analytical Chemistry, Vol. 90, Issue 1
Evaluation of an Artificial Neural Network Retention Index Model for Chemical Structure Identification in Nontargeted Metabolomics
journal, October 2018
- Samaraweera, Milinda A.; Hall, L. Mark; Hill, Dennis W.
- Analytical Chemistry, Vol. 90, Issue 21
DeepTox: Toxicity prediction using deep learning
journal, October 2017
- Klambauer, Günter; Unterthiner, Thomas; Mayr, Andreas
- Toxicology Letters, Vol. 280
Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking
text, January 2016
- M., Waters, Katrina; Yao, Peng,; L., McPhail, Kerry
- The University of North Carolina at Chapel Hill University Libraries
Beyond the Hype: Deep Neural Networks Outperform Established Methods Using A ChEMBL Bioactivity Benchmark Set
dataset, January 2019
- Lenselink, Eelke Bart; ten Dijke, N. (Niels); Bongers, Brandon
- 4TU.Centre for Research Data
MoleculeNet: A Benchmark for Molecular Machine Learning
preprint, January 2017
- Wu, Zhenqin; Ramsundar, Bharath; Feinberg, Evan N.
- arXiv
Works referencing / citing this record:
Machine Learning Applications for Mass Spectrometry-Based Metabolomics
journal, June 2020
- Liebal, Ulf W.; Phan, An N. T.; Sudhakar, Malvika
- Metabolites, Vol. 10, Issue 6
Machine Learning Applications for Mass Spectrometry-Based Metabolomics
text, January 2020
- Liebal, Ulf Winfried; Phan, An N. T.; Sudhakar, Malvika
- RWTH Aachen University
Figures / Tables found in this record: