DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: The METLIN small molecule dataset for machine learning-based retention time prediction

Journal Article · · Nature Communications
ORCiD logo [1];  [2]; ORCiD logo [2]; ORCiD logo [2];  [2];  [2];  [3]; ORCiD logo [2]; ORCiD logo [4]
  1. The Scripps Research Inst., La Jolla, CA (United States). Scripps Center for Metabolomics
  2. The Scripps Research Inst., La Jolla, CA (United States). Scripps Center for Metabolomics
  3. The Scripps Research Inst., La Jolla, CA (United States). California Institute for Biomedical Research (Calibr)
  4. The Scripps Research Inst., La Jolla, CA (United States). Scripps Center for Metabolomics The Scripps Research Inst., La Jolla, CA (United States). Department of Integrative Structural and Computational Biology

Machine learning has been extensively applied in small molecule analysis to predict a wide range of molecular properties and processes including mass spectrometry fragmentation or chromatographic retention time. However, current approaches for retention time prediction lack sufficient accuracy due to limited available experimental data. Here we introduce the METLIN small molecule retention time (SMRT) dataset, an experimentally acquired reverse-phase chromatography retention time dataset covering up to 80,038 small molecules. To demonstrate the utility of this dataset, we deployed a deep learning model for retention time prediction applied to small molecule annotation. Results showed that in 70% of the cases, the correct molecular identity was ranked among the top 3 candidates based on their predicted retention time. We anticipate that this dataset will enable the community to apply machine learning or first principles strategies to generate better models for retention time prediction.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Biological and Environmental Research (BER)
Grant/Contract Number:
AC02-05CH11231; R35GM130385; P30 MH062261; P01 DA026146; U01 CA235493
OSTI ID:
1624228
Journal Information:
Nature Communications, Vol. 10, Issue 1; ISSN 2041-1723
Publisher:
Nature Publishing GroupCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 91 works
Citation information provided by
Web of Science

References (80)

CSM-lig: a web server for assessing and comparing protein–small molecule affinities journal May 2016
UPLC–MS retention time prediction: a machine learning approach to metabolite identification in untargeted profiling journal November 2015
Mind the Gap: Mapping Mass Spectral Databases in Genome-Scale Metabolic Networks Reveals Poorly Covered Areas journal September 2018
A study on retention “projection” as a supplementary means for compound identification by liquid chromatography–mass spectrometry capable of predicting retention with different gradients, flow rates, and instruments journal September 2011
Use of dual-filtering to create training sets leading to improved accuracy in quantitative structure-retention relationships modelling for hydrophilic interaction liquid chromatographic systems journal July 2017
Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set journal August 2017
Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking journal August 2016
Development and application of retention time prediction models in the suspect and non-target screening of emerging contaminants journal February 2019
Extended-Connectivity Fingerprints journal April 2010
Review on modelling aspects in reversed-phase liquid chromatographic quantitative structure–retention relationships journal October 2007
CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra journal June 2014
Toward Global Metabolomics Analysis with Hydrophilic Interaction Liquid Chromatography–Mass Spectrometry: Improved Metabolite Identification by Retention Time Prediction journal November 2011
CFM-ID 3.0: Significantly Improved ESI-MS/MS Prediction and Compound Identification journal April 2019
SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information journal March 2019
Mass spectral databases for LC/MS- and GC/MS-based metabolomics: State of the field and future prospects journal April 2016
Retention time prediction for dereplication of natural products (CxHyOz) in LC–MS metabolite profiling journal December 2014
A Simple Representation of Three-Dimensional Molecular Structure journal August 2017
Molecular fingerprint similarity search in virtual screening journal January 2015
METLIN: A Technology Platform for Identifying Knowns and Unknowns journal January 2018
Liquid-chromatography retention order prediction for metabolite identification journal September 2018
Autonomous METLIN-Guided In-source Fragment Annotation for Untargeted Metabolomics journal January 2019
XCMS-MRM and METLIN-MRM: a cloud library and public resource for targeted analysis of small molecules journal August 2018
Structure Annotation of All Mass Spectra in Untargeted Metabolomics journal January 2019
Comprehensive and Empirical Evaluation of Machine Learning Algorithms for Small Molecule LC Retention Time Prediction journal January 2019
Retention Index Prediction Using Quantitative Structure–Retention Relationships for Improving Structure Identification in Nontargeted Metabolomics journal June 2018
Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks journal March 2019
Predicting retention time in hydrophilic interaction liquid chromatography mass spectrometry and its use for peak annotation in metabolomics journal September 2014
Retention Time Prediction Improves Identification in Nontargeted Lipidomics Approaches journal July 2015
Improved Peptide Retention Time Prediction in Liquid Chromatography through Deep Learning journal August 2018
ClassyFire: automated chemical classification with a comprehensive, computable taxonomy journal November 2016
DeepTox: Toxicity Prediction using Deep Learning journal February 2016
Proposed minimum reporting standards for chemical analysis: Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI) journal September 2007
HMDB 4.0: the human metabolome database for 2018 journal November 2017
PDB-wide collection of binding data: current status of the PDBbind database journal October 2014
Software Tools and Approaches for Compound Identification of LC-MS/MS Data in Metabolomics journal May 2018
Enabling Efficient and Confident Annotation of LC−MS Metabolomics Data through MS1 Spectrum and Time Prediction journal September 2016
The use of LC predicted retention times to extend metabolites identification with SWATH data acquisition journal December 2017
The rise of deep learning in drug discovery journal June 2018
PredRet: Prediction of Retention Time by Direct Mapping between Multiple Chromatographic Systems journal August 2015
Easy and accurate high-performance liquid chromatography retention prediction with different gradients, flow rates, and instruments by back-calculation of gradient and flow rate profiles journal September 2011
Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? journal May 2015
Kernel-Based, Partial Least Squares Quantitative Structure-Retention Relationship Model for UPLC Retention Time Prediction: A Useful Tool for Metabolite Identification journal September 2016
Quantitative structure–retention relationships models for prediction of high performance liquid chromatography retention time of small molecules: Endogenous metabolites and banned compounds journal October 2013
Retention projection enables accurate calculation of liquid chromatographic retention times across labs and methods journal September 2015
Machine learning in chemoinformatics and drug discovery journal August 2018
Annotation: A Computational Solution for Streamlining Metabolomics Analysis journal November 2017
Evaluation of an Artificial Neural Network Retention Index Model for Chemical Structure Identification in Nontargeted Metabolomics journal October 2018
DeepTox: Toxicity prediction using deep learning journal October 2017
Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking text January 2016
Beyond the Hype: Deep Neural Networks Outperform Established Methods Using A ChEMBL Bioactivity Benchmark Set dataset January 2019
A Simple Representation of Three-Dimensional Molecular Structure text January 2017
MoleculeNet: A Benchmark for Molecular Machine Learning preprint January 2017
The role of machine learning method in the synthesis and biological ınvestigation of heterocyclic compounds journal October 2021
Predicting retention time in hydrophilic interaction liquid chromatography mass spectrometry and its use for peak annotation in metabolomics journal September 2014
Review on modelling aspects in reversed-phase liquid chromatographic quantitative structure–retention relationships journal October 2007
Quantitative structure–retention relationships models for prediction of high performance liquid chromatography retention time of small molecules: Endogenous metabolites and banned compounds journal October 2013
Easy and accurate high-performance liquid chromatography retention prediction with different gradients, flow rates, and instruments by back-calculation of gradient and flow rate profiles journal September 2011
A study on retention “projection” as a supplementary means for compound identification by liquid chromatography–mass spectrometry capable of predicting retention with different gradients, flow rates, and instruments journal September 2011
Retention projection enables accurate calculation of liquid chromatographic retention times across labs and methods journal September 2015
The use of LC predicted retention times to extend metabolites identification with SWATH data acquisition journal December 2017
Development and application of retention time prediction models in the suspect and non-target screening of emerging contaminants journal February 2019
Molecular fingerprint similarity search in virtual screening journal January 2015
Toward Global Metabolomics Analysis with Hydrophilic Interaction Liquid Chromatography–Mass Spectrometry: Improved Metabolite Identification by Retention Time Prediction journal November 2011
METLIN: A Technology Platform for Identifying Knowns and Unknowns journal January 2018
Retention Index Prediction Using Quantitative Structure–Retention Relationships for Improving Structure Identification in Nontargeted Metabolomics journal June 2018
Improved Peptide Retention Time Prediction in Liquid Chromatography through Deep Learning journal August 2018
Autonomous METLIN-Guided In-source Fragment Annotation for Untargeted Metabolomics journal January 2019
Structure Annotation of All Mass Spectra in Untargeted Metabolomics journal January 2019
Comprehensive and Empirical Evaluation of Machine Learning Algorithms for Small Molecule LC Retention Time Prediction journal January 2019
Extended-Connectivity Fingerprints journal April 2010
DIAMetAlyzer allows automated false-discovery rate-controlled analysis for data-independent acquisition in metabolomics journal March 2022
XCMS-MRM and METLIN-MRM: a cloud library and public resource for targeted analysis of small molecules journal August 2018
Liquid-chromatography retention order prediction for metabolite identification journal September 2018
CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra journal June 2014
HMDB 4.0: the human metabolome database for 2018 journal November 2017
DeepTox: Toxicity Prediction using Deep Learning journal February 2016
Software Tools and Approaches for Compound Identification of LC-MS/MS Data in Metabolomics journal May 2018
Mind the Gap: Mapping Mass Spectral Databases in Genome-Scale Metabolic Networks Reveals Poorly Covered Areas journal September 2018
MoleculeNet: A Benchmark for Molecular Machine Learning preprint January 2017
The METLIN small molecule dataset for machine learning-based retention time prediction dataset January 2019

Cited By (2)

Machine Learning Applications for Mass Spectrometry-Based Metabolomics journal June 2020
Machine Learning Applications for Mass Spectrometry-Based Metabolomics text January 2020

Figures / Tables (5)