skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Robust predictions of specialized metabolism genes through machine learning

Abstract

Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thalianaas a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve themore » SM prediction model. Application of the prediction model led to the identification of 1,220 A. thaliana genes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome.« less

Authors:
ORCiD logo; ; ORCiD logo; ORCiD logo; ; ; ; ORCiD logo; ; ORCiD logo
Publication Date:
Research Org.:
Univ. of Wisconsin, Madison, WI (United States); USDOE Bioenergy Research Centers (BRC) (United States). Great Lakes Bioenergy Research Center (GLBRC)
Sponsoring Org.:
USDOE Office of Science (SC), Biological and Environmental Research (BER); National Science Foundation (NSF)
OSTI Identifier:
1491911
Alternate Identifier(s):
OSTI ID: 1612975
Grant/Contract Number:  
SC0018409; IOS-1546617; NSF DEB-1655386
Resource Type:
Journal Article: Published Article
Journal Name:
Proceedings of the National Academy of Sciences of the United States of America
Additional Journal Information:
Journal Name: Proceedings of the National Academy of Sciences of the United States of America Journal Volume: 116 Journal Issue: 6; Journal ID: ISSN 0027-8424
Publisher:
National Academy of Sciences
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; science & technology - other topics; specialized metabolism; machine learning; predictive biology; data integration

Citation Formats

Moore, Bethany M., Wang, Peipei, Fan, Pengxiang, Leong, Bryan, Schenck, Craig A., Lloyd, John P., Lehti-Shiu, Melissa D., Last, Robert L., Pichersky, Eran, and Shiu, Shin-Han. Robust predictions of specialized metabolism genes through machine learning. United States: N. p., 2019. Web. doi:10.1073/pnas.1817074116.
Moore, Bethany M., Wang, Peipei, Fan, Pengxiang, Leong, Bryan, Schenck, Craig A., Lloyd, John P., Lehti-Shiu, Melissa D., Last, Robert L., Pichersky, Eran, & Shiu, Shin-Han. Robust predictions of specialized metabolism genes through machine learning. United States. doi:10.1073/pnas.1817074116.
Moore, Bethany M., Wang, Peipei, Fan, Pengxiang, Leong, Bryan, Schenck, Craig A., Lloyd, John P., Lehti-Shiu, Melissa D., Last, Robert L., Pichersky, Eran, and Shiu, Shin-Han. Wed . "Robust predictions of specialized metabolism genes through machine learning". United States. doi:10.1073/pnas.1817074116.
@article{osti_1491911,
title = {Robust predictions of specialized metabolism genes through machine learning},
author = {Moore, Bethany M. and Wang, Peipei and Fan, Pengxiang and Leong, Bryan and Schenck, Craig A. and Lloyd, John P. and Lehti-Shiu, Melissa D. and Last, Robert L. and Pichersky, Eran and Shiu, Shin-Han},
abstractNote = {Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thalianaas a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220 A. thaliana genes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome.},
doi = {10.1073/pnas.1817074116},
journal = {Proceedings of the National Academy of Sciences of the United States of America},
issn = {0027-8424},
number = 6,
volume = 116,
place = {United States},
year = {2019},
month = {1}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record at 10.1073/pnas.1817074116

Citation Metrics:
Cited by: 5 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Utility and Limitations of Using Gene Expression Data to Identify Functional Associations
journal, December 2016


Gardening the genome: DNA methylation in Arabidopsis thaliana
journal, May 2005

  • Chan, Simon W. -L.; Henderson, Ian R.; Jacobsen, Steven E.
  • Nature Reviews Genetics, Vol. 6, Issue 5
  • DOI: 10.1038/nrg1601

Evolution of a Novel Phenolic Pathway for Pollen Development
journal, September 2009


Coselected genes determine adaptive variation in herbivore resistance throughout the native range of Arabidopsis thaliana
journal, March 2015

  • Brachi, Benjamin; Meyer, Christopher G.; Villoutreix, Romain
  • Proceedings of the National Academy of Sciences, Vol. 112, Issue 13
  • DOI: 10.1073/pnas.1421416112

Achieving Diversity in the Face of Constraints: Lessons from Metabolism
journal, June 2012


A Global Coexpression Network Approach for Connecting Genes to Specialized Metabolic Pathways in Plants
journal, April 2017

  • Wisecaver, Jennifer H.; Borowsky, Alexander T.; Tzin, Vered
  • The Plant Cell, Vol. 29, Issue 5
  • DOI: 10.1105/tpc.17.00009

Genomic Signatures of Specialized Metabolism in Plants
journal, May 2014


Linking DNA methylation and histone modification: patterns and paradigms
journal, May 2009

  • Cedar, Howard; Bergman, Yehudit
  • Nature Reviews Genetics, Vol. 10, Issue 5
  • DOI: 10.1038/nrg2540

Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants
journal, February 2013

  • De Smet, R.; Adams, K. L.; Vandepoele, K.
  • Proceedings of the National Academy of Sciences, Vol. 110, Issue 8
  • DOI: 10.1073/pnas.1300127110

The secondary metabolism of Arabidopsis thaliana: growing like a weed
journal, June 2005


The butterfly plant arms-race escalated by gene and genome duplications
journal, June 2015

  • Edger, Patrick P.; Heidel-Fischer, Hanna M.; Bekaert, Michaël
  • Proceedings of the National Academy of Sciences, Vol. 112, Issue 27
  • DOI: 10.1073/pnas.1503926112

Pseudogenes: Are They “Junk” or Functional DNA?
journal, December 2003


Network analysis for gene discovery in plant-specialized metabolism: Gene discovery in plant specialized metabolism
journal, February 2013

  • Higashi, Yasuhiro; Saito, Kazuki
  • Plant, Cell & Environment, Vol. 36, Issue 9
  • DOI: 10.1111/pce.12069

Characteristics and Significance of Intergenic Polyadenylated RNA Transcription in Arabidopsis
journal, November 2012

  • Moghe, Gaurav D.; Lehti-Shiu, Melissa D.; Seddon, Alex E.
  • Plant Physiology, Vol. 161, Issue 1
  • DOI: 10.1104/pp.112.205245

Growth–Defense Tradeoffs in Plants: A Balancing Act to Optimize Fitness
journal, August 2014

  • Huot, Bethany; Yao, Jian; Montgomery, Beronda L.
  • Molecular Plant, Vol. 7, Issue 8
  • DOI: 10.1093/mp/ssu049

Convergent Evolution in Plant Specialized Metabolism
journal, June 2011


Genome-Wide Prediction of Metabolic Enzymes, Pathways, and Gene Clusters in Plants
journal, February 2017

  • Schläpfer, Pascal; Zhang, Peifen; Wang, Chuan
  • Plant Physiology, Vol. 173, Issue 4
  • DOI: 10.1104/pp.16.01942

Cytochrome P450-mediated metabolic engineering: current progress and future challenges
journal, June 2014

  • Renault, Hugues; Bassard, Jean-Etienne; Hamberger, Björn
  • Current Opinion in Plant Biology, Vol. 19
  • DOI: 10.1016/j.pbi.2014.03.004

Paclitaxel: biosynthesis, production and future prospects
journal, May 2014


Gene Ontology: tool for the unification of biology
journal, May 2000

  • Ashburner, Michael; Ball, Catherine A.; Blake, Judith A.
  • Nature Genetics, Vol. 25, Issue 1
  • DOI: 10.1038/75556

Transcriptional Control of Photosynthesis Genes: The Evolutionarily Conserved Regulatory Mechanism in Plastid Genome Function
journal, January 2010

  • Puthiyaveetil, Sujith; Ibrahim, Iskander M.; Jeličić, Branka
  • Genome Biology and Evolution, Vol. 2
  • DOI: 10.1093/gbe/evq073

Plant cell culture for production of paclitaxel and other taxanes
journal, December 2002


Striking Similarities in the Genomic Distribution of Tandemly Arrayed Genes in Arabidopsis and Rice
journal, January 2006


NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
journal, September 2010

  • Valiev, M.; Bylaska, E. J.; Govind, N.
  • Computer Physics Communications, Vol. 181, Issue 9, p. 1477-1489
  • DOI: 10.1016/j.cpc.2010.04.018

The AtGenExpress hormone and chemical treatment data set: experimental design, data evaluation, model data analysis and data access
journal, August 2008


An Overview of Gibberellin Metabolism Enzyme Genes and Their Related Mutants in Rice
journal, April 2004

  • Sakamoto, Tomoaki; Miura, Koutarou; Itoh, Hironori
  • Plant Physiology, Vol. 134, Issue 4
  • DOI: 10.1104/pp.103.033696

affy--analysis of Affymetrix GeneChip data at the probe level
journal, February 2004


A feedback insensitive isopropylmalate synthase affects acylsugar composition in cultivated and wild tomato
journal, May 2015


Evidence for Network Evolution in an Arabidopsis Interactome Map
journal, July 2011


Evolution of gene duplication in plants
journal, June 2016

  • Panchy, Nicholas; Lehti-Shiu, Melissa D.; Shiu, Shin-Han
  • Plant Physiology
  • DOI: 10.1104/pp.16.00523

Butterflies and Plants: A Study in Coevolution
journal, December 1964

  • Ehrlich, Paul R.; Raven, Peter H.
  • Evolution, Vol. 18, Issue 4
  • DOI: 10.2307/2406212

mapman: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes
journal, March 2004


The Pfam protein families database: towards a more sustainable future
journal, December 2015

  • Finn, Robert D.; Coggill, Penelope; Eberhardt, Ruth Y.
  • Nucleic Acids Research, Vol. 44, Issue D1
  • DOI: 10.1093/nar/gkv1344

Identification of metagenes and their Interactions through Large-scale Analysis of Arabidopsis Gene Expression Data
journal, January 2012


limma powers differential expression analyses for RNA-sequencing and microarray studies
journal, January 2015

  • Ritchie, Matthew E.; Phipson, Belinda; Wu, Di
  • Nucleic Acids Research, Vol. 43, Issue 7
  • DOI: 10.1093/nar/gkv007

Organ and Cell Type–Specific Complementary Expression Patterns and Regulatory Neofunctionalization between Duplicated Genes in Arabidopsis thaliana
journal, January 2011

  • Liu, Shao-Lun; Baute, Gregory J.; Adams, Keith L.
  • Genome Biology and Evolution, Vol. 3
  • DOI: 10.1093/gbe/evr114

Secondary metabolic gene clusters: evolutionary toolkits for chemical innovation
journal, October 2010


From waste products to ecochemicals: Fifty years research of plant secondary metabolism
journal, November 2007


Importance of Lineage-Specific Expansion of Plant Tandem Duplicates in the Adaptive Response to Environmental Stimuli
journal, August 2008

  • Hanada, Kousuke; Zou, Cheng; Lehti-Shiu, Melissa D.
  • Plant Physiology, Vol. 148, Issue 2
  • DOI: 10.1104/pp.108.122457

Characteristics of Plant Essential Genes Allow for within- and between-Species Prediction of Lethal Mutant Phenotypes
journal, August 2015

  • Lloyd, John P.; Seddon, Alexander E.; Moghe, Gaurav D.
  • The Plant Cell, Vol. 27, Issue 8
  • DOI: 10.1105/tpc.15.00051

Molecular basis of the evolution of alternative tyrosine biosynthetic routes in plants
journal, June 2017

  • Schenck, Craig A.; Holland, Cynthia K.; Schneider, Matthew R.
  • Nature Chemical Biology, Vol. 13, Issue 9
  • DOI: 10.1038/nchembio.2414

Transcriptional Coordination of the Metabolic Network in Arabidopsis
journal, August 2006

  • Wei, Hairong; Persson, Staffan; Mehta, Tapan
  • Plant Physiology, Vol. 142, Issue 2
  • DOI: 10.1104/pp.106.080358

The COG database: an updated version includes eukaryotes
journal, January 2003

  • Tatusov, Roman L.; Fedorova, Natalie D.; Jackson, John D.
  • BMC Bioinformatics, Vol. 4, Article No. 41
  • DOI: 10.1186/1471-2105-4-41

Whole Genome and Tandem Duplicate Retention Facilitated Glucosinolate Pathway Diversification in the Mustard Family
journal, October 2013

  • Hofberger, Johannes A.; Lyons, Eric; Edger, Patrick P.
  • Genome Biology and Evolution, Vol. 5, Issue 11
  • DOI: 10.1093/gbe/evt162

OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups
journal, January 2006


A gene expression map of Arabidopsis thaliana development
journal, April 2005

  • Schmid, Markus; Davison, Timothy S.; Henz, Stefan R.
  • Nature Genetics, Vol. 37, Issue 5
  • DOI: 10.1038/ng1543

The Diurnal Project: Diurnal and Circadian Expression Profiling, Model-based Pattern Matching, and Promoter Analysis
journal, January 2007

  • Mockler, T. C.; Michael, T. P.; Priest, H. D.
  • Cold Spring Harbor Symposia on Quantitative Biology, Vol. 72, Issue 1
  • DOI: 10.1101/sqb.2007.72.006

Asymmetry of plant-mediated interactions between specialist aphids and caterpillars on two milkweeds
journal, April 2014


A gene cluster for secondary metabolism in oat: Implications for the evolution of metabolic diversity in plants
journal, May 2004

  • Qi, X.; Bakht, S.; Leggett, M.
  • Proceedings of the National Academy of Sciences, Vol. 101, Issue 21
  • DOI: 10.1073/pnas.0401301101

Metabolic engineering of carotenoid biosynthesis in plants
journal, March 2008


Molecular Evidence for Functional Divergence and Decay of a Transcription Factor Derived from Whole-Genome Duplication in Arabidopsis thaliana
journal, June 2015

  • Lehti-Shiu, Melissa D.; Uygun, Sahra; Moghe, Gaurav D.
  • Plant Physiology, Vol. 168, Issue 4
  • DOI: 10.1104/pp.15.00689