A machine learning-based service for estimating quality of genomes using PATRIC
Abstract
Recent advances in high-volume sequencing technology and mining of genomes from metagenomic samples call for rapid and reliable genome quality evaluation. The current release of the PATRIC database contains over 220,000 genomes, and current metagenomic technology supports assemblies of many draft-quality genomes from a single sample, most of which will be novel. We have added two quality assessment tools to the PATRIC annotation pipeline. EvalCon uses supervised machine learning to calculate an annotation consistency score. EvalG implements a variant of the CheckM algorithm to estimate contamination and completeness of an annotated genome.We report on the performance of these tools and the potential utility of the consistency score. Additionally, we provide contamination, completeness, and consistency measures for all genomes in PATRIC and in a recent set of metagenomic assemblies. EvalG and EvalCon facilitate the rapid quality control and exploration of PATRIC-annotated draft genomes.
- Authors:
-
- Fellowship for Interpretation of Genomes, Burr Ridge, IL (United States); Univ. of Chicago, Chicago, IL (United States)
- Argonne National Lab. (ANL), Lemont, IL (United States)
- Fellowship for Interpretation of Genomes, Burr Ridge, IL (United States)
- Fellowship for Interpretation of Genomes, Burr Ridge, IL (United States); Argonne National Lab. (ANL), Lemont, IL (United States)
- Publication Date:
- Research Org.:
- Argonne National Lab. (ANL), Argonne, IL (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC); National Institutes of Health (NIH) - National Institute of Allergy and Infectious Diseases (NIAID)
- OSTI Identifier:
- 1579345
- Grant/Contract Number:
- AC02-06CH11357; HHSN272201400027C
- Resource Type:
- Accepted Manuscript
- Journal Name:
- BMC Bioinformatics
- Additional Journal Information:
- Journal Volume: 20; Journal Issue: 1; Journal ID: ISSN 1471-2105
- Publisher:
- BioMed Central
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; CheckM; RAST; genome annotation; genome quality; machine learning; metagenomics; random forest; supervised learning
Citation Formats
Parrello, Bruce, Butler, Rory, Chlenski, Philippe, Olson, Robert, Overbeek, Jamie C., Pusch, Gordon D., Vonstein, Veronika, and Overbeek, Ross. A machine learning-based service for estimating quality of genomes using PATRIC. United States: N. p., 2019.
Web. doi:10.1186/s12859-019-3068-y.
Parrello, Bruce, Butler, Rory, Chlenski, Philippe, Olson, Robert, Overbeek, Jamie C., Pusch, Gordon D., Vonstein, Veronika, & Overbeek, Ross. A machine learning-based service for estimating quality of genomes using PATRIC. United States. doi:10.1186/s12859-019-3068-y.
Parrello, Bruce, Butler, Rory, Chlenski, Philippe, Olson, Robert, Overbeek, Jamie C., Pusch, Gordon D., Vonstein, Veronika, and Overbeek, Ross. Thu .
"A machine learning-based service for estimating quality of genomes using PATRIC". United States. doi:10.1186/s12859-019-3068-y. https://www.osti.gov/servlets/purl/1579345.
@article{osti_1579345,
title = {A machine learning-based service for estimating quality of genomes using PATRIC},
author = {Parrello, Bruce and Butler, Rory and Chlenski, Philippe and Olson, Robert and Overbeek, Jamie C. and Pusch, Gordon D. and Vonstein, Veronika and Overbeek, Ross},
abstractNote = {Recent advances in high-volume sequencing technology and mining of genomes from metagenomic samples call for rapid and reliable genome quality evaluation. The current release of the PATRIC database contains over 220,000 genomes, and current metagenomic technology supports assemblies of many draft-quality genomes from a single sample, most of which will be novel. We have added two quality assessment tools to the PATRIC annotation pipeline. EvalCon uses supervised machine learning to calculate an annotation consistency score. EvalG implements a variant of the CheckM algorithm to estimate contamination and completeness of an annotated genome.We report on the performance of these tools and the potential utility of the consistency score. Additionally, we provide contamination, completeness, and consistency measures for all genomes in PATRIC and in a recent set of metagenomic assemblies. EvalG and EvalCon facilitate the rapid quality control and exploration of PATRIC-annotated draft genomes.},
doi = {10.1186/s12859-019-3068-y},
journal = {BMC Bioinformatics},
number = 1,
volume = 20,
place = {United States},
year = {2019},
month = {10}
}
Web of Science
Works referenced in this record:
Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center
journal, November 2016
- Wattam, Alice R.; Davis, James J.; Assaf, Rida
- Nucleic Acids Research, Vol. 45, Issue D1
PATRIC: The VBI PathoSystems Resource Integration Center
journal, January 2007
- Snyder, E. E.; Kampanya, N.; Lu, J.
- Nucleic Acids Research, Vol. 35, Issue Database
Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle
journal, January 2019
- Pasolli, Edoardo; Asnicar, Francesco; Manara, Serena
- Cell, Vol. 176, Issue 3
Anvi’o: an advanced analysis and visualization platform for ‘omics data
journal, January 2015
- Eren, A. Murat; Esen, Özcan C.; Quince, Christopher
- PeerJ, Vol. 3
BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs
journal, June 2015
- Simão, Felipe A.; Waterhouse, Robert M.; Ioannidis, Panagiotis
- Bioinformatics, Vol. 31, Issue 19
CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes
journal, May 2015
- Parks, Donovan H.; Imelfort, Michael; Skennerton, Connor T.
- Genome Research, Vol. 25, Issue 7
MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities
journal, January 2015
- Kang, Dongwan D.; Froula, Jeff; Egan, Rob
- PeerJ, Vol. 3
RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes
journal, February 2015
- Brettin, Thomas; Davis, James J.; Disz, Terry
- Scientific Reports, Vol. 5, Issue 1
The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes
journal, September 2005
- Overbeek, R.
- Nucleic Acids Research, Vol. 33, Issue 17
The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)
journal, November 2013
- Overbeek, Ross; Olson, Robert; Pusch, Gordon D.
- Nucleic Acids Research, Vol. 42, Issue D1
Works referencing / citing this record:
Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle
journal, January 2019
- Pasolli, Edoardo; Asnicar, Francesco; Manara, Serena
- Cell, Vol. 176, Issue 3
RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes
journal, February 2015
- Brettin, Thomas; Davis, James J.; Disz, Terry
- Scientific Reports, Vol. 5, Issue 1
BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs
journal, June 2015
- Simão, Felipe A.; Waterhouse, Robert M.; Ioannidis, Panagiotis
- Bioinformatics, Vol. 31, Issue 19
The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes
journal, September 2005
- Overbeek, R.
- Nucleic Acids Research, Vol. 33, Issue 17
PATRIC: The VBI PathoSystems Resource Integration Center
journal, January 2007
- Snyder, E. E.; Kampanya, N.; Lu, J.
- Nucleic Acids Research, Vol. 35, Issue Database
The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)
journal, November 2013
- Overbeek, Ross; Olson, Robert; Pusch, Gordon D.
- Nucleic Acids Research, Vol. 42, Issue D1
Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center
journal, November 2016
- Wattam, Alice R.; Davis, James J.; Assaf, Rida
- Nucleic Acids Research, Vol. 45, Issue D1
CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes
journal, May 2015
- Parks, Donovan H.; Imelfort, Michael; Skennerton, Connor T.
- Genome Research, Vol. 25, Issue 7
MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities
journal, January 2015
- Kang, Dongwan D.; Froula, Jeff; Egan, Rob
- PeerJ, Vol. 3
Anvi’o: an advanced analysis and visualization platform for ‘omics data
journal, January 2015
- Eren, A. Murat; Esen, Özcan C.; Quince, Christopher
- PeerJ, Vol. 3
The PATRIC Bioinformatics Resource Center: expanding data and analysis capabilities
journal, October 2019
- Davis, James J.; Wattam, Alice R.; Aziz, Ramy K.
- Nucleic Acids Research