DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns

Abstract

Confident identification of unknown chemicals in high resolution mass spectrometry (HRMS) screening studies requires cohesive workflows and complementary data, tools, and software. Chemistry databases, screening libraries, and chemical metadata have become fixtures in identification workflows. To increase confidence in compound identifications, the use of structural fragmentation data collected via tandem mass spectrometry (MS/MS or MS2) is vital. However, the availability of empirically collected MS/MS data for identification of unknowns is limited. Researchers have therefore turned to in silico generation of MS/MS data for use in HRMS-based screening studies. This paper describes the generation en masse of predicted MS/MS spectra for the entirety of the US EPA’s DSSTox database using competitive fragmentation modelling and a freely available open source tool, CFM-ID. The generated dataset comprises predicted MS/MS spectra for ~700,000 structures, and mappings between predicted spectra, structures, associated substances, and chemical metadata. Together, these resources facilitate improved compound identifications in HRMS screening studies. These data are accessible via an SQL database, a comma-separated export file (.csv), and EPA’s CompTox Chemicals Dashboard.

Authors:
ORCiD logo [1];  [2];  [3];  [3];  [4];  [5];  [6];  [5]
  1. Oak Ridge Inst. for Science and Education (ORISE), Durham, NC (United States). Environmental Protection Agency; US Environmental Protection Agency (EPA), Research Triangle Park, NC (United States). Office of Research and Development. National Center for Computational Toxicology
  2. CSRA Inc., Research Triangle Park. Durham, NC (United States)
  3. GDIT, Research Triangle Park, Durham, NC (United States)
  4. Oak Ridge Associated Univ., Durham, NC (United States)
  5. US Environmental Protection Agency (EPA), Research Triangle Park, NC (United States). Office of Research and Development. National Center for Computational Toxicology
  6. US Environmental Protection Agency (EPA), Research Triangle Park, NC (United States). Office of Research and Development. National Exposure Research Lab.
Publication Date:
Research Org.:
Oak Ridge Institute for Science and Education (ORISE), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1624275
Grant/Contract Number:  
SC0014664
Resource Type:
Accepted Manuscript
Journal Name:
Scientific Data
Additional Journal Information:
Journal Volume: 6; Journal Issue: 1; Journal ID: ISSN 2052-4463
Publisher:
Nature Publishing Group
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; 32 ENERGY CONSERVATION, CONSUMPTION, AND UTILIZATION; Science & Technology - Other Topics

Citation Formats

McEachran, Andrew D., Balabin, Ilya, Cathey, Tommy, Transue, Thomas R., Al-Ghoul, Hussein, Grulke, Chris, Sobus, Jon R., and Williams, Antony J. Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns. United States: N. p., 2019. Web. doi:10.1038/s41597-019-0145-z.
McEachran, Andrew D., Balabin, Ilya, Cathey, Tommy, Transue, Thomas R., Al-Ghoul, Hussein, Grulke, Chris, Sobus, Jon R., & Williams, Antony J. Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns. United States. https://doi.org/10.1038/s41597-019-0145-z
McEachran, Andrew D., Balabin, Ilya, Cathey, Tommy, Transue, Thomas R., Al-Ghoul, Hussein, Grulke, Chris, Sobus, Jon R., and Williams, Antony J. Fri . "Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns". United States. https://doi.org/10.1038/s41597-019-0145-z. https://www.osti.gov/servlets/purl/1624275.
@article{osti_1624275,
title = {Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns},
author = {McEachran, Andrew D. and Balabin, Ilya and Cathey, Tommy and Transue, Thomas R. and Al-Ghoul, Hussein and Grulke, Chris and Sobus, Jon R. and Williams, Antony J.},
abstractNote = {Confident identification of unknown chemicals in high resolution mass spectrometry (HRMS) screening studies requires cohesive workflows and complementary data, tools, and software. Chemistry databases, screening libraries, and chemical metadata have become fixtures in identification workflows. To increase confidence in compound identifications, the use of structural fragmentation data collected via tandem mass spectrometry (MS/MS or MS2) is vital. However, the availability of empirically collected MS/MS data for identification of unknowns is limited. Researchers have therefore turned to in silico generation of MS/MS data for use in HRMS-based screening studies. This paper describes the generation en masse of predicted MS/MS spectra for the entirety of the US EPA’s DSSTox database using competitive fragmentation modelling and a freely available open source tool, CFM-ID. The generated dataset comprises predicted MS/MS spectra for ~700,000 structures, and mappings between predicted spectra, structures, associated substances, and chemical metadata. Together, these resources facilitate improved compound identifications in HRMS screening studies. These data are accessible via an SQL database, a comma-separated export file (.csv), and EPA’s CompTox Chemicals Dashboard.},
doi = {10.1038/s41597-019-0145-z},
journal = {Scientific Data},
number = 1,
volume = 6,
place = {United States},
year = {Fri Aug 02 00:00:00 EDT 2019},
month = {Fri Aug 02 00:00:00 EDT 2019}
}

Works referenced in this record:

Integrating tools for non-targeted analysis research and chemical safety evaluations at the US EPA
journal, December 2017

  • Sobus, Jon R.; Wambaugh, John F.; Isaacs, Kristin K.
  • Journal of Exposure Science & Environmental Epidemiology, Vol. 28, Issue 5
  • DOI: 10.1038/s41370-017-0012-y

Nontarget Screening with High Resolution Mass Spectrometry in the Environment: Ready to Go?
journal, September 2017

  • Hollender, Juliane; Schymanski, Emma L.; Singer, Heinz P.
  • Environmental Science & Technology, Vol. 51, Issue 20
  • DOI: 10.1021/acs.est.7b02184

Exposome-Scale Investigations Guided by Global Metabolomics, Pathway Analysis, and Cognitive Computing
journal, October 2017


Open Science for Identifying “Known Unknown” Chemicals
journal, April 2017

  • Schymanski, Emma L.; Williams, Antony J.
  • Environmental Science & Technology, Vol. 51, Issue 10
  • DOI: 10.1021/acs.est.7b01908

Critical Assessment of Small Molecule Identification 2016: automated methods
journal, March 2017

  • Schymanski, Emma L.; Ruttkies, Christoph; Krauss, Martin
  • Journal of Cheminformatics, Vol. 9, Issue 1
  • DOI: 10.1186/s13321-017-0207-1

Identifying known unknowns using the US EPA’s CompTox Chemistry Dashboard
journal, December 2016

  • McEachran, Andrew D.; Sobus, Jon R.; Williams, Antony J.
  • Analytical and Bioanalytical Chemistry, Vol. 409, Issue 7
  • DOI: 10.1007/s00216-016-0139-z

MetFrag relaunched: incorporating strategies beyond in silico fragmentation
journal, January 2016

  • Ruttkies, Christoph; Schymanski, Emma L.; Wolf, Sebastian
  • Journal of Cheminformatics, Vol. 8, Issue 1
  • DOI: 10.1186/s13321-016-0115-9

Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification
journal, June 2014


Comprehensive comparison of in silico MS/MS fragmentation tools of the CASMI contest: database boosting is needed to achieve 93% accuracy
journal, May 2017

  • Blaženović, Ivana; Kind, Tobias; Torbašinović, Hrvoje
  • Journal of Cheminformatics, Vol. 9, Issue 1
  • DOI: 10.1186/s13321-017-0219-x

Mass spectral databases for LC/MS- and GC/MS-based metabolomics: State of the field and future prospects
journal, April 2016

  • Vinaixa, Maria; Schymanski, Emma L.; Neumann, Steffen
  • TrAC Trends in Analytical Chemistry, Vol. 78
  • DOI: 10.1016/j.trac.2015.09.005

MassBank: a public repository for sharing mass spectral data for life sciences
journal, July 2010

  • Horai, Hisayuki; Arita, Masanori; Kanaya, Shigehiko
  • Journal of Mass Spectrometry, Vol. 45, Issue 7
  • DOI: 10.1002/jms.1777

METLIN: A Metabolite Mass Spectral Database
journal, January 2005


Using prepared mixtures of ToxCast chemicals to evaluate non-targeted analysis (NTA) method performance
journal, January 2019

  • Sobus, Jon R.; Grossman, Jarod N.; Chao, Alex
  • Analytical and Bioanalytical Chemistry, Vol. 411, Issue 4
  • DOI: 10.1007/s00216-018-1526-4

Searching molecular structure databases with tandem mass spectra using CSI:FingerID
journal, September 2015

  • Dührkop, Kai; Shen, Huibin; Meusel, Marvin
  • Proceedings of the National Academy of Sciences, Vol. 112, Issue 41
  • DOI: 10.1073/pnas.1509788112

The CompTox Chemistry Dashboard: a community data resource for environmental chemistry
journal, November 2017

  • Williams, Antony J.; Grulke, Christopher M.; Edwards, Jeff
  • Journal of Cheminformatics, Vol. 9, Issue 1
  • DOI: 10.1186/s13321-017-0247-6

CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra
journal, June 2014

  • Allen, Felicity; Pon, Allison; Wilson, Michael
  • Nucleic Acids Research, Vol. 42, Issue W1
  • DOI: 10.1093/nar/gku436

Computational Prediction of Electron Ionization Mass Spectra to Assist in GC/MS Compound Identification
journal, July 2016


EPA’s non-targeted analysis collaborative trial (ENTACT): genesis, design, and initial findings
journal, December 2018

  • Ulrich, Elin M.; Sobus, Jon R.; Grulke, Christopher M.
  • Analytical and Bioanalytical Chemistry, Vol. 411, Issue 4
  • DOI: 10.1007/s00216-018-1435-6

“MS-Ready” structures for non-targeted high-resolution mass spectrometry screening studies
journal, August 2018

  • McEachran, Andrew D.; Mansouri, Kamel; Grulke, Chris
  • Journal of Cheminformatics, Vol. 10, Issue 1
  • DOI: 10.1186/s13321-018-0299-2

CFM-ID Paper Data
dataset, January 2019

  • Toxicology, EPA's National Center For Computational
  • The United States Environmental Protection Agency’s Center for Computational Toxicology and Exposure
  • DOI: 10.23645/epacomptox.7776212.v1

The Chemical and Products Database, a resource for exposure-relevant data on chemicals in consumer products
journal, July 2018

  • Dionisio, Kathie L.; Phillips, Katherine; Price, Paul S.
  • Scientific Data, Vol. 5, Issue 1
  • DOI: 10.1038/sdata.2018.125

Optimization and testing of mass spectral library search algorithms for compound identification
journal, September 1994

  • Stein, Stephen E.; Scott, Donald R.
  • Journal of the American Society for Mass Spectrometry, Vol. 5, Issue 9
  • DOI: 10.1016/1044-0305(94)87009-8

Data Structures for Statistical Computing in Python
conference, January 2010


S0 | SUSDAT | Merged NORMAN Suspect List: SusDat
dataset, January 2020


ToxCast Chemical Landscape: Paving the Road to 21st Century Toxicology
journal, July 2016


Comparative analysis of mass spectral matching-based compound identification in gas chromatography–mass spectrometry
journal, July 2013


Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
journal, November 2011

  • Little, James L.; Williams, Antony J.; Pshenichnov, Alexey
  • Journal of The American Society for Mass Spectrometry, Vol. 23, Issue 1
  • DOI: 10.1007/s13361-011-0265-y

CFM-ID Paper Data
dataset, January 2019

  • Toxicology, EPA's National Center For Computational
  • The United States Environmental Protection Agency’s Center for Computational Toxicology and Exposure
  • DOI: 10.23645/epacomptox.7776212

MetFrag relaunched: incorporating strategies beyond in silico fragmentation
text, January 2016


S0 | SUSDAT | Merged NORMAN Suspect List: SusDat
dataset, January 2020


Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
text, January 2013


S0 | SUSDAT | Merged NORMAN Suspect List: SusDat
dataset, January 2020


Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
text, January 2013


S0 | SUSDAT | Merged NORMAN Suspect List: SusDat
dataset, January 2021


S0 | SUSDAT | Merged NORMAN Suspect List: SusDat
dataset, January 2022


MassBank: a public repository for sharing mass spectral data for life sciences
journal, July 2010

  • Horai, Hisayuki; Arita, Masanori; Kanaya, Shigehiko
  • Journal of Mass Spectrometry, Vol. 45, Issue 7
  • DOI: 10.1002/jms.1777

Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification
journal, June 2014


Optimization and testing of mass spectral library search algorithms for compound identification
journal, September 1994

  • Stein, Stephen E.; Scott, Donald R.
  • Journal of the American Society for Mass Spectrometry, Vol. 5, Issue 9
  • DOI: 10.1016/1044-0305(94)87009-8

Comparative analysis of mass spectral matching-based compound identification in gas chromatography–mass spectrometry
journal, July 2013


Computational Prediction of Electron Ionization Mass Spectra to Assist in GC/MS Compound Identification
journal, July 2016


Exposome-Scale Investigations Guided by Global Metabolomics, Pathway Analysis, and Cognitive Computing
journal, October 2017


Nontarget Screening with High Resolution Mass Spectrometry in the Environment: Ready to Go?
journal, September 2017

  • Hollender, Juliane; Schymanski, Emma L.; Singer, Heinz P.
  • Environmental Science & Technology, Vol. 51, Issue 20
  • DOI: 10.1021/acs.est.7b02184

Integrating tools for non-targeted analysis research and chemical safety evaluations at the US EPA
journal, December 2017

  • Sobus, Jon R.; Wambaugh, John F.; Isaacs, Kristin K.
  • Journal of Exposure Science & Environmental Epidemiology, Vol. 28, Issue 5
  • DOI: 10.1038/s41370-017-0012-y

CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra
journal, June 2014

  • Allen, Felicity; Pon, Allison; Wilson, Michael
  • Nucleic Acids Research, Vol. 42, Issue W1
  • DOI: 10.1093/nar/gku436

METLIN: A Metabolite Mass Spectral Database
journal, January 2005


MetFrag relaunched: incorporating strategies beyond in silico fragmentation
journal, January 2016

  • Ruttkies, Christoph; Schymanski, Emma L.; Wolf, Sebastian
  • Journal of Cheminformatics, Vol. 8, Issue 1
  • DOI: 10.1186/s13321-016-0115-9

Comprehensive comparison of in silico MS/MS fragmentation tools of the CASMI contest: database boosting is needed to achieve 93% accuracy
journal, May 2017

  • Blaženović, Ivana; Kind, Tobias; Torbašinović, Hrvoje
  • Journal of Cheminformatics, Vol. 9, Issue 1
  • DOI: 10.1186/s13321-017-0219-x

The CompTox Chemistry Dashboard: a community data resource for environmental chemistry
journal, November 2017

  • Williams, Antony J.; Grulke, Christopher M.; Edwards, Jeff
  • Journal of Cheminformatics, Vol. 9, Issue 1
  • DOI: 10.1186/s13321-017-0247-6

“MS-Ready” structures for non-targeted high-resolution mass spectrometry screening studies
journal, August 2018

  • McEachran, Andrew D.; Mansouri, Kamel; Grulke, Chris
  • Journal of Cheminformatics, Vol. 10, Issue 1
  • DOI: 10.1186/s13321-018-0299-2

CFM-ID Paper Data
dataset, January 2019

  • Toxicology, EPA's National Center For Computational
  • The United States Environmental Protection Agency’s Center for Computational Toxicology and Exposure
  • DOI: 10.23645/epacomptox.7776212

S0 | SUSDAT | Merged NORMAN Suspect List: SusDat
dataset, January 2022


Works referencing / citing this record:

In silico MS/MS spectra for identifying unknowns: a critical examination using CFM-ID algorithms and ENTACT mixture samples
journal, January 2020

  • Chao, Alex; Al-Ghoul, Hussein; McEachran, Andrew D.
  • Analytical and Bioanalytical Chemistry, Vol. 412, Issue 6
  • DOI: 10.1007/s00216-019-02351-7

CFM-ID 3.0: Significantly Improved ESI-MS/MS Prediction and Compound Identification
journal, April 2019

  • Djoumbou-Feunang, Yannick; Pon, Allison; Karu, Naama
  • Metabolites, Vol. 9, Issue 4
  • DOI: 10.3390/metabo9040072