skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Robust predictions of specialized metabolism genes through machine learning

Journal Article · · Proceedings of the National Academy of Sciences of the United States of America

Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thalianaas a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220 A. thaliana genes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome.

Research Organization:
Univ. of Wisconsin, Madison, WI (United States); USDOE Bioenergy Research Centers (BRC) (United States). Great Lakes Bioenergy Research Center (GLBRC)
Sponsoring Organization:
USDOE Office of Science (SC), Biological and Environmental Research (BER); National Science Foundation (NSF)
Grant/Contract Number:
SC0018409; IOS-1546617; NSF DEB-1655386
OSTI ID:
1491911
Alternate ID(s):
OSTI ID: 1612975
Journal Information:
Proceedings of the National Academy of Sciences of the United States of America, Journal Name: Proceedings of the National Academy of Sciences of the United States of America Vol. 116 Journal Issue: 6; ISSN 0027-8424
Publisher:
National Academy of SciencesCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 56 works
Citation information provided by
Web of Science

References (58)

Utility and Limitations of Using Gene Expression Data to Identify Functional Associations journal December 2016
Consequences of Whole-Genome Triplication as Revealed by Comparative Genomic Analyses of the Wild Radish Raphanus raphanistrum and Three Other Brassicaceae Species journal May 2014
A Role for Gene Duplication and Natural Variation of Gene Expression in the Evolution of Metabolism journal March 2008
Gardening the genome: DNA methylation in Arabidopsis thaliana journal May 2005
Evolution of a Novel Phenolic Pathway for Pollen Development journal September 2009
Coselected genes determine adaptive variation in herbivore resistance throughout the native range of Arabidopsis thaliana journal March 2015
Achieving Diversity in the Face of Constraints: Lessons from Metabolism journal June 2012
Resistance management in a native plant: nicotine prevents herbivores from compensating for plant protease inhibitors journal June 2007
A Global Coexpression Network Approach for Connecting Genes to Specialized Metabolic Pathways in Plants journal April 2017
Genomic Signatures of Specialized Metabolism in Plants journal May 2014
Linking DNA methylation and histone modification: patterns and paradigms journal May 2009
Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants journal February 2013
The secondary metabolism of Arabidopsis thaliana: growing like a weed journal June 2005
The butterfly plant arms-race escalated by gene and genome duplications journal June 2015
Pseudogenes: Are They “Junk” or Functional DNA? journal December 2003
Network analysis for gene discovery in plant-specialized metabolism: Gene discovery in plant specialized metabolism journal February 2013
Characteristics and Significance of Intergenic Polyadenylated RNA Transcription in Arabidopsis journal November 2012
Growth–Defense Tradeoffs in Plants: A Balancing Act to Optimize Fitness journal August 2014
Convergent Evolution in Plant Specialized Metabolism journal June 2011
Genome-Wide Prediction of Metabolic Enzymes, Pathways, and Gene Clusters in Plants journal February 2017
Cytochrome P450-mediated metabolic engineering: current progress and future challenges journal June 2014
Paclitaxel: biosynthesis, production and future prospects journal May 2014
Gene Ontology: tool for the unification of biology journal May 2000
Transcriptional Control of Photosynthesis Genes: The Evolutionarily Conserved Regulatory Mechanism in Plastid Genome Function journal January 2010
Plant cell culture for production of paclitaxel and other taxanes journal December 2002
Striking Similarities in the Genomic Distribution of Tandemly Arrayed Genes in Arabidopsis and Rice journal January 2006
The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses: AtGenExpress global abiotic stress data set journal March 2007
The arabidopsis information resource: Making and mining the “gold standard” annotated reference plant genome: Tair: Making and Mining the “Gold Standard” Plant Genome journal August 2015
The AtGenExpress hormone and chemical treatment data set: experimental design, data evaluation, model data analysis and data access journal August 2008
An Overview of Gibberellin Metabolism Enzyme Genes and Their Related Mutants in Rice journal April 2004
affy--analysis of Affymetrix GeneChip data at the probe level journal February 2004
The family of terpene synthases in plants: a mid-size family of genes for specialized metabolism that is highly diversified throughout the kingdom: Terpene synthase family journal March 2011
Evidence for Network Evolution in an Arabidopsis Interactome Map journal July 2011
Evolution of gene duplication in plants journal June 2016
mapman: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes journal March 2004
The Pfam protein families database: towards a more sustainable future journal December 2015
Identification of metagenes and their Interactions through Large-scale Analysis of Arabidopsis Gene Expression Data journal January 2012
Recruitment of a duplicated primary metabolism gene into the nicotine biosynthesis regulon in tobacco: Regulation of tobacco QPT genes journal June 2011
limma powers differential expression analyses for RNA-sequencing and microarray studies journal January 2015
Organ and Cell Type–Specific Complementary Expression Patterns and Regulatory Neofunctionalization between Duplicated Genes in Arabidopsis thaliana journal January 2011
Secondary metabolic gene clusters: evolutionary toolkits for chemical innovation journal October 2010
From waste products to ecochemicals: Fifty years research of plant secondary metabolism journal November 2007
Importance of Lineage-Specific Expansion of Plant Tandem Duplicates in the Adaptive Response to Environmental Stimuli journal August 2008
Expression pattern similarities support the prediction of orthologs retaining common functions after gene duplication events journal June 2016
Characteristics of Plant Essential Genes Allow for within- and between-Species Prediction of Lethal Mutant Phenotypes journal August 2015
Butterflies and Plants: a Study in Coevolution journal December 1964
Molecular basis of the evolution of alternative tyrosine biosynthetic routes in plants journal June 2017
Transcriptional Coordination of the Metabolic Network in Arabidopsis journal August 2006
The COG database: an updated version includes eukaryotes journal January 2003
Whole Genome and Tandem Duplicate Retention Facilitated Glucosinolate Pathway Diversification in the Mustard Family journal October 2013
Analysis of the genome sequence of the flowering plant Arabidopsis thaliana journal December 2000
OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups journal January 2006
A gene expression map of Arabidopsis thaliana development journal April 2005
The Diurnal Project: Diurnal and Circadian Expression Profiling, Model-based Pattern Matching, and Promoter Analysis journal January 2007
Asymmetry of plant-mediated interactions between specialist aphids and caterpillars on two milkweeds journal April 2014
A gene cluster for secondary metabolism in oat: Implications for the evolution of metabolic diversity in plants journal May 2004
Metabolic engineering of carotenoid biosynthesis in plants journal March 2008
Molecular Evidence for Functional Divergence and Decay of a Transcription Factor Derived from Whole-Genome Duplication in Arabidopsis thaliana journal June 2015