skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Robust predictions of specialized metabolism genes through machine learning

Abstract

Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thaliana as a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improvemore » the SM prediction model. Application of the prediction model led to the identification of 1,220 A. thaliana genes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome.« less

Authors:
ORCiD logo; ; ORCiD logo; ORCiD logo; ; ; ; ORCiD logo; ; ORCiD logo
Publication Date:
Sponsoring Org.:
USDOE Office of Science (SC), Biological and Environmental Research (BER) (SC-23)
OSTI Identifier:
1491911
Grant/Contract Number:  
SC0018409
Resource Type:
Journal Article: Published Article
Journal Name:
Proceedings of the National Academy of Sciences of the United States of America
Additional Journal Information:
Journal Name: Proceedings of the National Academy of Sciences of the United States of America Journal Volume: 116 Journal Issue: 6; Journal ID: ISSN 0027-8424
Publisher:
Proceedings of the National Academy of Sciences
Country of Publication:
United States
Language:
English

Citation Formats

Moore, Bethany M., Wang, Peipei, Fan, Pengxiang, Leong, Bryan, Schenck, Craig A., Lloyd, John P., Lehti-Shiu, Melissa D., Last, Robert L., Pichersky, Eran, and Shiu, Shin-Han. Robust predictions of specialized metabolism genes through machine learning. United States: N. p., 2019. Web. doi:10.1073/pnas.1817074116.
Moore, Bethany M., Wang, Peipei, Fan, Pengxiang, Leong, Bryan, Schenck, Craig A., Lloyd, John P., Lehti-Shiu, Melissa D., Last, Robert L., Pichersky, Eran, & Shiu, Shin-Han. Robust predictions of specialized metabolism genes through machine learning. United States. doi:10.1073/pnas.1817074116.
Moore, Bethany M., Wang, Peipei, Fan, Pengxiang, Leong, Bryan, Schenck, Craig A., Lloyd, John P., Lehti-Shiu, Melissa D., Last, Robert L., Pichersky, Eran, and Shiu, Shin-Han. Wed . "Robust predictions of specialized metabolism genes through machine learning". United States. doi:10.1073/pnas.1817074116.
@article{osti_1491911,
title = {Robust predictions of specialized metabolism genes through machine learning},
author = {Moore, Bethany M. and Wang, Peipei and Fan, Pengxiang and Leong, Bryan and Schenck, Craig A. and Lloyd, John P. and Lehti-Shiu, Melissa D. and Last, Robert L. and Pichersky, Eran and Shiu, Shin-Han},
abstractNote = {Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thaliana as a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220 A. thaliana genes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome.},
doi = {10.1073/pnas.1817074116},
journal = {Proceedings of the National Academy of Sciences of the United States of America},
issn = {0027-8424},
number = 6,
volume = 116,
place = {United States},
year = {2019},
month = {1}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record at 10.1073/pnas.1817074116

Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
journal, September 2010

  • Valiev, M.; Bylaska, E. J.; Govind, N.
  • Computer Physics Communications, Vol. 181, Issue 9, p. 1477-1489
  • DOI: 10.1016/j.cpc.2010.04.018

mapman: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes
journal, March 2004


The COG database: an updated version includes eukaryotes
journal, January 2003

  • Tatusov, Roman L.; Fedorova, Natalie D.; Jackson, John D.
  • BMC Bioinformatics, Vol. 4, Article No. 41
  • DOI: 10.1186/1471-2105-4-41