Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Predicting variable gene content in Escherichia coli using conserved genes

Journal Article · · mSystems
 [1];  [2];  [2];  [2];  [2];  [2];  [3];  [2];  [1]
  1. Argonne National Laboratory (ANL), Argonne, IL (United States); Univ. of Chicago, IL (United States)
  2. Hope College, Holland, MI (United States)
  3. Univ. of Chicago, IL (United States); Fellowship for Interpretation of Genomes, Burr Ridge, IL (United States)
Having the ability to predict the protein-encoding gene content of an incomplete genome or metagenome-assembled genome is important for a variety of bioinformatic tasks. In this study, as a proof of concept, we built machine learning classifiers for predicting variable gene content in Escherichia coli genomes using only the nucleotide k-mers from a set of 100 conserved genes as features. Protein families were used to define orthologs, and a single classifier was built for predicting the presence or absence of each protein family occurring in 10%–90% of all E. coli genomes. The resulting set of 3,259 extreme gradient boosting classifiers had a per-genome average macro F1 score of 0.944 [0.943–0.945, 95% CI]. We show that the F1 scores are stable across multi-locus sequence types and that the trend can be recapitulated by sampling a smaller number of core genes or diverse input genomes. Surprisingly, the presence or absence of poorly annotated proteins, including “hypothetical proteins” was accurately predicted (F1 = 0.902 [0.898–0.906, 95% CI]). Models for proteins with horizontal gene transfer-related functions had slightly lower F1 scores but were still accurate (F1s = 0.895, 0.872, 0.824, and 0.841 for transposon, phage, plasmid, and antimicrobial resistance-related functions, respectively). Finally, using a holdout set of 419 diverse E. coli genomes that were isolated from freshwater environmental sources, we observed an average per-genome F1 score of 0.880 [0.876–0.883, 95% CI], demonstrating the extensibility of the models. Overall, this study provides a framework for predicting variable gene content using a limited amount of input sequence data.
Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
Defense Advanced Research Projects Agency (DARPA); National Institutes of Health (NIH); National Science Foundation (NSF); USDOE
Grant/Contract Number:
AC02-06CH11357
OSTI ID:
2324772
Journal Information:
mSystems, Journal Name: mSystems Journal Issue: 4 Vol. 8; ISSN 2379-5077
Publisher:
American Society for MicrobiologyCopyright Statement
Country of Publication:
United States
Language:
English

References (58)

Comparative genomics: the bacterial pan-genome journal October 2008
Exploration of machine learning algorithms for predicting the changes in abundance of antibiotic resistance genes in anaerobic digestion journal September 2022
Predicting antibiotic resistance gene abundance in activated sludge using shotgun metagenomics and machine learning journal September 2021
MicFunPred: A conserved approach to predict functional profiles from 16S rRNA gene sequence data journal November 2021
Determining Hosts of Antibiotic Resistance Genes: A Review of Methodological Advances journal April 2020
Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea journal August 2017
Metagenomic species profiling using universal phylogenetic marker genes journal October 2013
Rapid inference of antibiotic resistance and susceptibility by genomic neighbour typing journal February 2020
CheckV assesses the quality and completeness of metagenome-assembled viral genomes journal December 2020
PICRUSt2 for prediction of metagenome functions journal June 2020
Developing an in silico minimum inhibitory concentration panel test for Klebsiella pneumoniae journal January 2018
RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes journal February 2015
A genomic data resource for predicting antimicrobial resistance from laboratory-derived antimicrobial susceptibility phenotypes journal August 2021
PATRIC as a unique resource for studying antimicrobial resistance journal July 2017
Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation journal October 2006
Jalview Version 2--a multiple sequence alignment editor and analysis workbench journal January 2009
KMC 3: counting and manipulating k-mer statistics journal May 2017
ARGs-OAP v2.0 with an expanded SARG database and Hidden Markov Models for enhancement characterization and quantification of antibiotic resistance genes in environmental metagenomes journal February 2018
A pan-genome-based machine learning approach for predicting antimicrobial resistance activities of the Escherichia coli strains journal June 2018
Inferring microbiota functions from taxonomic genes: a review journal January 2022
ResFinder 4.0 for predictions of phenotypes from genotypes journal August 2020
MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability journal January 2013
BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics journal December 2017
KEGG: Kyoto Encyclopedia of Genes and Genomes journal January 2000
GenBank journal November 2020
Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR journal November 2022
Assessing the gene space in draft genomes journal November 2008
The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) journal November 2013
panX: pan-genome analysis and exploration journal October 2017
VFDB 2019: a comparative pathogenomic platform with an interactive web interface journal November 2018
CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database journal October 2019
The PATRIC Bioinformatics Resource Center: expanding data and analysis capabilities journal October 2019
ARIBA: rapid antimicrobial resistance genotyping directly from sequencing reads journal October 2017
CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning posted_content July 2022
CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes journal May 2015
Predicting antimicrobial susceptibility from the bacterial genome: A new paradigm for one health resistance monitoring journal October 2020
Validating the AMRFinder Tool and Resistance Gene Database by Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of Isolates journal August 2019
The Pangenome Structure of Escherichia coli: Comparative Genomic Analysis of E. coli Commensal and Pathogenic Isolates journal August 2008
Using Machine Learning To Predict Antimicrobial MICs and Associated Genomic Features for Nontyphoidal Salmonella journal October 2018
Predicting Antimicrobial Resistance Using Partial Genome Alignments journal June 2021
XGBoost: A Scalable Tree Boosting System conference January 2016
BLAST+: architecture and applications journal January 2009
BIGSdb: Scalable analysis of bacterial genome variation at the population level journal December 2010
A machine learning-based service for estimating quality of genomes using PATRIC journal October 2019
PanFP: pangenome-based functional profiles for microbial communities journal September 2015
DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data journal February 2018
PathoFact: a pipeline for the prediction of virulence factors and antimicrobial resistance genes in metagenomic data journal February 2021
Toward a standard in structural genome annotation for prokaryotes journal July 2015
Tax4Fun2: prediction of habitat-specific functional profiles and functional redundancy based on 16S rRNA gene sequences journal May 2020
Antibiotic resistance prediction for Mycobacterium tuberculosis from genome sequence data with Mykrobe journal January 2019
Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data journal December 2018
Predicting antimicrobial resistance using conserved genes journal October 2020
Organised Genome Dynamics in the Escherichia coli Species Results in Highly Diverse Adaptive Paths journal January 2009
FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments journal March 2010
Microbial Communities Can Be Described by Metabolic Structure: A General Framework and Application to a Seasonally Variable, Depth-Stratified Microbial Community from the Coastal West Antarctic Peninsula journal August 2015
Piphillin: Improved Prediction of Metagenomic Content by Direct Inference from Human Microbiomes journal November 2016
PATtyFams: Protein Families for the Microbial Genomes in the PATRIC Database journal February 2016
Anvi’o: an advanced analysis and visualization platform for ‘omics data journal January 2015

Similar Records

Predicting antimicrobial resistance using conserved genes
Journal Article · Sun Oct 18 20:00:00 EDT 2020 · PLoS Computational Biology (Online) · OSTI ID:1757970