skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns

Journal Article · · Scientific Data
ORCiD logo [1];  [2];  [3];  [3];  [4];  [5];  [6];  [5]
  1. Oak Ridge Inst. for Science and Education (ORISE), Durham, NC (United States). Environmental Protection Agency; US Environmental Protection Agency (EPA), Research Triangle Park, NC (United States). Office of Research and Development. National Center for Computational Toxicology
  2. CSRA Inc., Research Triangle Park. Durham, NC (United States)
  3. GDIT, Research Triangle Park, Durham, NC (United States)
  4. Oak Ridge Associated Univ., Durham, NC (United States)
  5. US Environmental Protection Agency (EPA), Research Triangle Park, NC (United States). Office of Research and Development. National Center for Computational Toxicology
  6. US Environmental Protection Agency (EPA), Research Triangle Park, NC (United States). Office of Research and Development. National Exposure Research Lab.

Confident identification of unknown chemicals in high resolution mass spectrometry (HRMS) screening studies requires cohesive workflows and complementary data, tools, and software. Chemistry databases, screening libraries, and chemical metadata have become fixtures in identification workflows. To increase confidence in compound identifications, the use of structural fragmentation data collected via tandem mass spectrometry (MS/MS or MS2) is vital. However, the availability of empirically collected MS/MS data for identification of unknowns is limited. Researchers have therefore turned to in silico generation of MS/MS data for use in HRMS-based screening studies. This paper describes the generation en masse of predicted MS/MS spectra for the entirety of the US EPA’s DSSTox database using competitive fragmentation modelling and a freely available open source tool, CFM-ID. The generated dataset comprises predicted MS/MS spectra for ~700,000 structures, and mappings between predicted spectra, structures, associated substances, and chemical metadata. Together, these resources facilitate improved compound identifications in HRMS screening studies. These data are accessible via an SQL database, a comma-separated export file (.csv), and EPA’s CompTox Chemicals Dashboard.

Research Organization:
Oak Ridge Institute for Science and Education (ORISE), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
SC0014664
OSTI ID:
1624275
Journal Information:
Scientific Data, Vol. 6, Issue 1; ISSN 2052-4463
Publisher:
Nature Publishing GroupCopyright Statement
Country of Publication:
United States
Language:
English

References (37)

Integrating tools for non-targeted analysis research and chemical safety evaluations at the US EPA journal December 2017
Nontarget Screening with High Resolution Mass Spectrometry in the Environment: Ready to Go? journal September 2017
Exposome-Scale Investigations Guided by Global Metabolomics, Pathway Analysis, and Cognitive Computing journal October 2017
Open Science for Identifying “Known Unknown” Chemicals journal April 2017
Critical Assessment of Small Molecule Identification 2016: automated methods journal March 2017
Identifying known unknowns using the US EPA’s CompTox Chemistry Dashboard journal December 2016
MetFrag relaunched: incorporating strategies beyond in silico fragmentation journal January 2016
Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification journal June 2014
Comprehensive comparison of in silico MS/MS fragmentation tools of the CASMI contest: database boosting is needed to achieve 93% accuracy journal May 2017
Mass spectral databases for LC/MS- and GC/MS-based metabolomics: State of the field and future prospects journal April 2016
MassBank: a public repository for sharing mass spectral data for life sciences journal July 2010
METLIN: A Metabolite Mass Spectral Database journal January 2005
Using prepared mixtures of ToxCast chemicals to evaluate non-targeted analysis (NTA) method performance journal January 2019
Searching molecular structure databases with tandem mass spectra using CSI:FingerID journal September 2015
The CompTox Chemistry Dashboard: a community data resource for environmental chemistry journal November 2017
CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra journal June 2014
Computational Prediction of Electron Ionization Mass Spectra to Assist in GC/MS Compound Identification journal July 2016
EPA’s non-targeted analysis collaborative trial (ENTACT): genesis, design, and initial findings journal December 2018
“MS-Ready” structures for non-targeted high-resolution mass spectrometry screening studies journal August 2018
CFM-ID Paper Data dataset January 2019
The Chemical and Products Database, a resource for exposure-relevant data on chemicals in consumer products journal July 2018
Optimization and testing of mass spectral library search algorithms for compound identification journal September 1994
Data Structures for Statistical Computing in Python conference January 2010
S0 | SUSDAT | Merged NORMAN Suspect List: SusDat dataset January 2020
ToxCast Chemical Landscape: Paving the Road to 21st Century Toxicology journal July 2016
Comparative analysis of mass spectral matching-based compound identification in gas chromatography–mass spectrometry journal July 2013
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider journal November 2011
CFM-ID Paper Data dataset January 2019
MetFrag relaunched: incorporating strategies beyond in silico fragmentation text January 2016
S0 | SUSDAT | Merged NORMAN Suspect List: SusDat dataset January 2020
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider text January 2013
S0 | SUSDAT | Merged NORMAN Suspect List: SusDat dataset January 2020
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider text January 2013
Competitive Fragmentation Modeling of ESI-MS/MS spectra for putative metabolite identification text January 2013
S0 | SUSDAT | Merged NORMAN Suspect List: SusDat dataset January 2021
S0 | SUSDAT | Merged NORMAN Suspect List: SusDat dataset January 2022
S0 | SUSDAT | Merged NORMAN Suspect List: SusDat dataset January 2022

Cited By (2)

In silico MS/MS spectra for identifying unknowns: a critical examination using CFM-ID algorithms and ENTACT mixture samples journal January 2020
CFM-ID 3.0: Significantly Improved ESI-MS/MS Prediction and Compound Identification journal April 2019