Predicting variable gene content in Escherichia coli using conserved genes
- Argonne National Laboratory (ANL), Argonne, IL (United States); Univ. of Chicago, IL (United States)
- Hope College, Holland, MI (United States)
- Univ. of Chicago, IL (United States); Fellowship for Interpretation of Genomes, Burr Ridge, IL (United States)
Having the ability to predict the protein-encoding gene content of an incomplete genome or metagenome-assembled genome is important for a variety of bioinformatic tasks. In this study, as a proof of concept, we built machine learning classifiers for predicting variable gene content in Escherichia coli genomes using only the nucleotide k-mers from a set of 100 conserved genes as features. Protein families were used to define orthologs, and a single classifier was built for predicting the presence or absence of each protein family occurring in 10%–90% of all E. coli genomes. The resulting set of 3,259 extreme gradient boosting classifiers had a per-genome average macro F1 score of 0.944 [0.943–0.945, 95% CI]. We show that the F1 scores are stable across multi-locus sequence types and that the trend can be recapitulated by sampling a smaller number of core genes or diverse input genomes. Surprisingly, the presence or absence of poorly annotated proteins, including “hypothetical proteins” was accurately predicted (F1 = 0.902 [0.898–0.906, 95% CI]). Models for proteins with horizontal gene transfer-related functions had slightly lower F1 scores but were still accurate (F1s = 0.895, 0.872, 0.824, and 0.841 for transposon, phage, plasmid, and antimicrobial resistance-related functions, respectively). Finally, using a holdout set of 419 diverse E. coli genomes that were isolated from freshwater environmental sources, we observed an average per-genome F1 score of 0.880 [0.876–0.883, 95% CI], demonstrating the extensibility of the models. Overall, this study provides a framework for predicting variable gene content using a limited amount of input sequence data.
- Research Organization:
- Argonne National Laboratory (ANL), Argonne, IL (United States)
- Sponsoring Organization:
- Defense Advanced Research Projects Agency (DARPA); National Institutes of Health (NIH); National Science Foundation (NSF); USDOE
- Grant/Contract Number:
- AC02-06CH11357
- OSTI ID:
- 2324772
- Journal Information:
- mSystems, Journal Name: mSystems Journal Issue: 4 Vol. 8; ISSN 2379-5077
- Publisher:
- American Society for MicrobiologyCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
Predicting antimicrobial resistance using conserved genes
Journal Article
·
Sun Oct 18 20:00:00 EDT 2020
· PLoS Computational Biology (Online)
·
OSTI ID:1757970