Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns
Abstract
Confident identification of unknown chemicals in high resolution mass spectrometry (HRMS) screening studies requires cohesive workflows and complementary data, tools, and software. Chemistry databases, screening libraries, and chemical metadata have become fixtures in identification workflows. To increase confidence in compound identifications, the use of structural fragmentation data collected via tandem mass spectrometry (MS/MS or MS2) is vital. However, the availability of empirically collected MS/MS data for identification of unknowns is limited. Researchers have therefore turned to in silico generation of MS/MS data for use in HRMS-based screening studies. This paper describes the generation en masse of predicted MS/MS spectra for the entirety of the US EPA’s DSSTox database using competitive fragmentation modelling and a freely available open source tool, CFM-ID. The generated dataset comprises predicted MS/MS spectra for ~700,000 structures, and mappings between predicted spectra, structures, associated substances, and chemical metadata. Together, these resources facilitate improved compound identifications in HRMS screening studies. These data are accessible via an SQL database, a comma-separated export file (.csv), and EPA’s CompTox Chemicals Dashboard.
- Authors:
-
- Oak Ridge Inst. for Science and Education (ORISE), Durham, NC (United States). Environmental Protection Agency; US Environmental Protection Agency (EPA), Research Triangle Park, NC (United States). Office of Research and Development. National Center for Computational Toxicology
- CSRA Inc., Research Triangle Park. Durham, NC (United States)
- GDIT, Research Triangle Park, Durham, NC (United States)
- Oak Ridge Associated Univ., Durham, NC (United States)
- US Environmental Protection Agency (EPA), Research Triangle Park, NC (United States). Office of Research and Development. National Center for Computational Toxicology
- US Environmental Protection Agency (EPA), Research Triangle Park, NC (United States). Office of Research and Development. National Exposure Research Lab.
- Publication Date:
- Research Org.:
- Oak Ridge Institute for Science and Education (ORISE), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC)
- OSTI Identifier:
- 1624275
- Grant/Contract Number:
- SC0014664
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Scientific Data
- Additional Journal Information:
- Journal Volume: 6; Journal Issue: 1; Journal ID: ISSN 2052-4463
- Publisher:
- Nature Publishing Group
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; 32 ENERGY CONSERVATION, CONSUMPTION, AND UTILIZATION; Science & Technology - Other Topics
Citation Formats
McEachran, Andrew D., Balabin, Ilya, Cathey, Tommy, Transue, Thomas R., Al-Ghoul, Hussein, Grulke, Chris, Sobus, Jon R., and Williams, Antony J. Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns. United States: N. p., 2019.
Web. doi:10.1038/s41597-019-0145-z.
McEachran, Andrew D., Balabin, Ilya, Cathey, Tommy, Transue, Thomas R., Al-Ghoul, Hussein, Grulke, Chris, Sobus, Jon R., & Williams, Antony J. Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns. United States. https://doi.org/10.1038/s41597-019-0145-z
McEachran, Andrew D., Balabin, Ilya, Cathey, Tommy, Transue, Thomas R., Al-Ghoul, Hussein, Grulke, Chris, Sobus, Jon R., and Williams, Antony J. Fri .
"Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns". United States. https://doi.org/10.1038/s41597-019-0145-z. https://www.osti.gov/servlets/purl/1624275.
@article{osti_1624275,
title = {Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns},
author = {McEachran, Andrew D. and Balabin, Ilya and Cathey, Tommy and Transue, Thomas R. and Al-Ghoul, Hussein and Grulke, Chris and Sobus, Jon R. and Williams, Antony J.},
abstractNote = {Confident identification of unknown chemicals in high resolution mass spectrometry (HRMS) screening studies requires cohesive workflows and complementary data, tools, and software. Chemistry databases, screening libraries, and chemical metadata have become fixtures in identification workflows. To increase confidence in compound identifications, the use of structural fragmentation data collected via tandem mass spectrometry (MS/MS or MS2) is vital. However, the availability of empirically collected MS/MS data for identification of unknowns is limited. Researchers have therefore turned to in silico generation of MS/MS data for use in HRMS-based screening studies. This paper describes the generation en masse of predicted MS/MS spectra for the entirety of the US EPA’s DSSTox database using competitive fragmentation modelling and a freely available open source tool, CFM-ID. The generated dataset comprises predicted MS/MS spectra for ~700,000 structures, and mappings between predicted spectra, structures, associated substances, and chemical metadata. Together, these resources facilitate improved compound identifications in HRMS screening studies. These data are accessible via an SQL database, a comma-separated export file (.csv), and EPA’s CompTox Chemicals Dashboard.},
doi = {10.1038/s41597-019-0145-z},
journal = {Scientific Data},
number = 1,
volume = 6,
place = {United States},
year = {Fri Aug 02 00:00:00 EDT 2019},
month = {Fri Aug 02 00:00:00 EDT 2019}
}
Works referenced in this record:
Integrating tools for non-targeted analysis research and chemical safety evaluations at the US EPA
journal, December 2017
- Sobus, Jon R.; Wambaugh, John F.; Isaacs, Kristin K.
- Journal of Exposure Science & Environmental Epidemiology, Vol. 28, Issue 5
Nontarget Screening with High Resolution Mass Spectrometry in the Environment: Ready to Go?
journal, September 2017
- Hollender, Juliane; Schymanski, Emma L.; Singer, Heinz P.
- Environmental Science & Technology, Vol. 51, Issue 20
Exposome-Scale Investigations Guided by Global Metabolomics, Pathway Analysis, and Cognitive Computing
journal, October 2017
- Warth, Benedikt; Spangler, Scott; Fang, Mingliang
- Analytical Chemistry, Vol. 89, Issue 21
Open Science for Identifying “Known Unknown” Chemicals
journal, April 2017
- Schymanski, Emma L.; Williams, Antony J.
- Environmental Science & Technology, Vol. 51, Issue 10
Critical Assessment of Small Molecule Identification 2016: automated methods
journal, March 2017
- Schymanski, Emma L.; Ruttkies, Christoph; Krauss, Martin
- Journal of Cheminformatics, Vol. 9, Issue 1
Identifying known unknowns using the US EPA’s CompTox Chemistry Dashboard
journal, December 2016
- McEachran, Andrew D.; Sobus, Jon R.; Williams, Antony J.
- Analytical and Bioanalytical Chemistry, Vol. 409, Issue 7
MetFrag relaunched: incorporating strategies beyond in silico fragmentation
journal, January 2016
- Ruttkies, Christoph; Schymanski, Emma L.; Wolf, Sebastian
- Journal of Cheminformatics, Vol. 8, Issue 1
Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification
journal, June 2014
- Allen, Felicity; Greiner, Russ; Wishart, David
- Metabolomics, Vol. 11, Issue 1
Comprehensive comparison of in silico MS/MS fragmentation tools of the CASMI contest: database boosting is needed to achieve 93% accuracy
journal, May 2017
- Blaženović, Ivana; Kind, Tobias; Torbašinović, Hrvoje
- Journal of Cheminformatics, Vol. 9, Issue 1
Mass spectral databases for LC/MS- and GC/MS-based metabolomics: State of the field and future prospects
journal, April 2016
- Vinaixa, Maria; Schymanski, Emma L.; Neumann, Steffen
- TrAC Trends in Analytical Chemistry, Vol. 78
MassBank: a public repository for sharing mass spectral data for life sciences
journal, July 2010
- Horai, Hisayuki; Arita, Masanori; Kanaya, Shigehiko
- Journal of Mass Spectrometry, Vol. 45, Issue 7
METLIN: A Metabolite Mass Spectral Database
journal, January 2005
- Smith, Colin A.; Maille, Grace O??; Want, Elizabeth J.
- Therapeutic Drug Monitoring, Vol. 27, Issue 6
Using prepared mixtures of ToxCast chemicals to evaluate non-targeted analysis (NTA) method performance
journal, January 2019
- Sobus, Jon R.; Grossman, Jarod N.; Chao, Alex
- Analytical and Bioanalytical Chemistry, Vol. 411, Issue 4
Searching molecular structure databases with tandem mass spectra using CSI:FingerID
journal, September 2015
- Dührkop, Kai; Shen, Huibin; Meusel, Marvin
- Proceedings of the National Academy of Sciences, Vol. 112, Issue 41
The CompTox Chemistry Dashboard: a community data resource for environmental chemistry
journal, November 2017
- Williams, Antony J.; Grulke, Christopher M.; Edwards, Jeff
- Journal of Cheminformatics, Vol. 9, Issue 1
CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra
journal, June 2014
- Allen, Felicity; Pon, Allison; Wilson, Michael
- Nucleic Acids Research, Vol. 42, Issue W1
Computational Prediction of Electron Ionization Mass Spectra to Assist in GC/MS Compound Identification
journal, July 2016
- Allen, Felicity; Pon, Allison; Greiner, Russ
- Analytical Chemistry, Vol. 88, Issue 15
EPA’s non-targeted analysis collaborative trial (ENTACT): genesis, design, and initial findings
journal, December 2018
- Ulrich, Elin M.; Sobus, Jon R.; Grulke, Christopher M.
- Analytical and Bioanalytical Chemistry, Vol. 411, Issue 4
“MS-Ready” structures for non-targeted high-resolution mass spectrometry screening studies
journal, August 2018
- McEachran, Andrew D.; Mansouri, Kamel; Grulke, Chris
- Journal of Cheminformatics, Vol. 10, Issue 1
CFM-ID Paper Data
dataset, January 2019
- Toxicology, EPA's National Center For Computational
- The United States Environmental Protection Agency’s Center for Computational Toxicology and Exposure
The Chemical and Products Database, a resource for exposure-relevant data on chemicals in consumer products
journal, July 2018
- Dionisio, Kathie L.; Phillips, Katherine; Price, Paul S.
- Scientific Data, Vol. 5, Issue 1
Optimization and testing of mass spectral library search algorithms for compound identification
journal, September 1994
- Stein, Stephen E.; Scott, Donald R.
- Journal of the American Society for Mass Spectrometry, Vol. 5, Issue 9
Data Structures for Statistical Computing in Python
conference, January 2010
- McKinney, Wes
- Proceedings of the Python in Science Conference
S0 | SUSDAT | Merged NORMAN Suspect List: SusDat
dataset, January 2020
- Network, Norman; Aalizadeh, Reza; Alygizakis, Nikiforos
- Zenodo
ToxCast Chemical Landscape: Paving the Road to 21st Century Toxicology
journal, July 2016
- Richard, Ann M.; Judson, Richard S.; Houck, Keith A.
- Chemical Research in Toxicology, Vol. 29, Issue 8
Comparative analysis of mass spectral matching-based compound identification in gas chromatography–mass spectrometry
journal, July 2013
- Koo, Imhoi; Kim, Seongho; Zhang, Xiang
- Journal of Chromatography A, Vol. 1298
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
journal, November 2011
- Little, James L.; Williams, Antony J.; Pshenichnov, Alexey
- Journal of The American Society for Mass Spectrometry, Vol. 23, Issue 1
CFM-ID Paper Data
dataset, January 2019
- Toxicology, EPA's National Center For Computational
- The United States Environmental Protection Agency’s Center for Computational Toxicology and Exposure
MetFrag relaunched: incorporating strategies beyond in silico fragmentation
text, January 2016
- Ruttkies, Christoph; Schymanski, Emma L.; Wolf, Sebastian
- ETH Zurich
S0 | SUSDAT | Merged NORMAN Suspect List: SusDat
dataset, January 2020
- Network, Norman; Aalizadeh, Reza; Alygizakis, Nikiforos
- Zenodo
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
text, January 2013
- Williams, Antony; Little, James; Pshenichnov, Alexey
- figshare
S0 | SUSDAT | Merged NORMAN Suspect List: SusDat
dataset, January 2020
- Network, Norman; Aalizadeh, Reza; Alygizakis, Nikiforos
- Zenodo
Identification of “Known Unknowns” Utilizing Accurate Mass Data and ChemSpider
text, January 2013
- Williams, Antony; Little, James; Pshenichnov, Alexey
- figshare
Competitive Fragmentation Modeling of ESI-MS/MS spectra for putative metabolite identification
text, January 2013
- Allen, F.; Greiner, R.; Wishart, D.
- arXiv
S0 | SUSDAT | Merged NORMAN Suspect List: SusDat
dataset, January 2021
- Network, Norman; Aalizadeh, Reza; Alygizakis, Nikiforos
- Zenodo
S0 | SUSDAT | Merged NORMAN Suspect List: SusDat
dataset, January 2022
- Network, Norman; Aalizadeh, Reza; Alygizakis, Nikiforos
- Zenodo
MassBank: a public repository for sharing mass spectral data for life sciences
journal, July 2010
- Horai, Hisayuki; Arita, Masanori; Kanaya, Shigehiko
- Journal of Mass Spectrometry, Vol. 45, Issue 7
Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification
journal, June 2014
- Allen, Felicity; Greiner, Russ; Wishart, David
- Metabolomics, Vol. 11, Issue 1
Optimization and testing of mass spectral library search algorithms for compound identification
journal, September 1994
- Stein, Stephen E.; Scott, Donald R.
- Journal of the American Society for Mass Spectrometry, Vol. 5, Issue 9
Comparative analysis of mass spectral matching-based compound identification in gas chromatography–mass spectrometry
journal, July 2013
- Koo, Imhoi; Kim, Seongho; Zhang, Xiang
- Journal of Chromatography A, Vol. 1298
Computational Prediction of Electron Ionization Mass Spectra to Assist in GC/MS Compound Identification
journal, July 2016
- Allen, Felicity; Pon, Allison; Greiner, Russ
- Analytical Chemistry, Vol. 88, Issue 15
Exposome-Scale Investigations Guided by Global Metabolomics, Pathway Analysis, and Cognitive Computing
journal, October 2017
- Warth, Benedikt; Spangler, Scott; Fang, Mingliang
- Analytical Chemistry, Vol. 89, Issue 21
Nontarget Screening with High Resolution Mass Spectrometry in the Environment: Ready to Go?
journal, September 2017
- Hollender, Juliane; Schymanski, Emma L.; Singer, Heinz P.
- Environmental Science & Technology, Vol. 51, Issue 20
Integrating tools for non-targeted analysis research and chemical safety evaluations at the US EPA
journal, December 2017
- Sobus, Jon R.; Wambaugh, John F.; Isaacs, Kristin K.
- Journal of Exposure Science & Environmental Epidemiology, Vol. 28, Issue 5
CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra
journal, June 2014
- Allen, Felicity; Pon, Allison; Wilson, Michael
- Nucleic Acids Research, Vol. 42, Issue W1
METLIN: A Metabolite Mass Spectral Database
journal, January 2005
- Smith, Colin A.; Maille, Grace O??; Want, Elizabeth J.
- Therapeutic Drug Monitoring, Vol. 27, Issue 6
MetFrag relaunched: incorporating strategies beyond in silico fragmentation
journal, January 2016
- Ruttkies, Christoph; Schymanski, Emma L.; Wolf, Sebastian
- Journal of Cheminformatics, Vol. 8, Issue 1
Comprehensive comparison of in silico MS/MS fragmentation tools of the CASMI contest: database boosting is needed to achieve 93% accuracy
journal, May 2017
- Blaženović, Ivana; Kind, Tobias; Torbašinović, Hrvoje
- Journal of Cheminformatics, Vol. 9, Issue 1
The CompTox Chemistry Dashboard: a community data resource for environmental chemistry
journal, November 2017
- Williams, Antony J.; Grulke, Christopher M.; Edwards, Jeff
- Journal of Cheminformatics, Vol. 9, Issue 1
“MS-Ready” structures for non-targeted high-resolution mass spectrometry screening studies
journal, August 2018
- McEachran, Andrew D.; Mansouri, Kamel; Grulke, Chris
- Journal of Cheminformatics, Vol. 10, Issue 1
CFM-ID Paper Data
dataset, January 2019
- Toxicology, EPA's National Center For Computational
- The United States Environmental Protection Agency’s Center for Computational Toxicology and Exposure
S0 | SUSDAT | Merged NORMAN Suspect List: SusDat
dataset, January 2022
- Network, Norman; Aalizadeh, Reza; Alygizakis, Nikiforos
- Zenodo
Works referencing / citing this record:
In silico MS/MS spectra for identifying unknowns: a critical examination using CFM-ID algorithms and ENTACT mixture samples
journal, January 2020
- Chao, Alex; Al-Ghoul, Hussein; McEachran, Andrew D.
- Analytical and Bioanalytical Chemistry, Vol. 412, Issue 6
CFM-ID 3.0: Significantly Improved ESI-MS/MS Prediction and Compound Identification
journal, April 2019
- Djoumbou-Feunang, Yannick; Pon, Allison; Karu, Naama
- Metabolites, Vol. 9, Issue 4