skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A machine learning-based service for estimating quality of genomes using PATRIC

Abstract

Recent advances in high-volume sequencing technology and mining of genomes from metagenomic samples call for rapid and reliable genome quality evaluation. The current release of the PATRIC database contains over 220,000 genomes, and current metagenomic technology supports assemblies of many draft-quality genomes from a single sample, most of which will be novel. We have added two quality assessment tools to the PATRIC annotation pipeline. EvalCon uses supervised machine learning to calculate an annotation consistency score. EvalG implements a variant of the CheckM algorithm to estimate contamination and completeness of an annotated genome.We report on the performance of these tools and the potential utility of the consistency score. Additionally, we provide contamination, completeness, and consistency measures for all genomes in PATRIC and in a recent set of metagenomic assemblies. EvalG and EvalCon facilitate the rapid quality control and exploration of PATRIC-annotated draft genomes.

Authors:
 [1];  [2]; ORCiD logo [3];  [2];  [4];  [3];  [3];  [1]
  1. Fellowship for Interpretation of Genomes, Burr Ridge, IL (United States); Univ. of Chicago, Chicago, IL (United States)
  2. Argonne National Lab. (ANL), Lemont, IL (United States)
  3. Fellowship for Interpretation of Genomes, Burr Ridge, IL (United States)
  4. Fellowship for Interpretation of Genomes, Burr Ridge, IL (United States); Argonne National Lab. (ANL), Lemont, IL (United States)
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC); National Institutes of Health (NIH) - National Institute of Allergy and Infectious Diseases (NIAID)
OSTI Identifier:
1579345
Grant/Contract Number:  
AC02-06CH11357; HHSN272201400027C
Resource Type:
Accepted Manuscript
Journal Name:
BMC Bioinformatics
Additional Journal Information:
Journal Volume: 20; Journal Issue: 1; Journal ID: ISSN 1471-2105
Publisher:
BioMed Central
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; CheckM; RAST; genome annotation; genome quality; machine learning; metagenomics; random forest; supervised learning

Citation Formats

Parrello, Bruce, Butler, Rory, Chlenski, Philippe, Olson, Robert, Overbeek, Jamie C., Pusch, Gordon D., Vonstein, Veronika, and Overbeek, Ross. A machine learning-based service for estimating quality of genomes using PATRIC. United States: N. p., 2019. Web. doi:10.1186/s12859-019-3068-y.
Parrello, Bruce, Butler, Rory, Chlenski, Philippe, Olson, Robert, Overbeek, Jamie C., Pusch, Gordon D., Vonstein, Veronika, & Overbeek, Ross. A machine learning-based service for estimating quality of genomes using PATRIC. United States. doi:10.1186/s12859-019-3068-y.
Parrello, Bruce, Butler, Rory, Chlenski, Philippe, Olson, Robert, Overbeek, Jamie C., Pusch, Gordon D., Vonstein, Veronika, and Overbeek, Ross. Thu . "A machine learning-based service for estimating quality of genomes using PATRIC". United States. doi:10.1186/s12859-019-3068-y. https://www.osti.gov/servlets/purl/1579345.
@article{osti_1579345,
title = {A machine learning-based service for estimating quality of genomes using PATRIC},
author = {Parrello, Bruce and Butler, Rory and Chlenski, Philippe and Olson, Robert and Overbeek, Jamie C. and Pusch, Gordon D. and Vonstein, Veronika and Overbeek, Ross},
abstractNote = {Recent advances in high-volume sequencing technology and mining of genomes from metagenomic samples call for rapid and reliable genome quality evaluation. The current release of the PATRIC database contains over 220,000 genomes, and current metagenomic technology supports assemblies of many draft-quality genomes from a single sample, most of which will be novel. We have added two quality assessment tools to the PATRIC annotation pipeline. EvalCon uses supervised machine learning to calculate an annotation consistency score. EvalG implements a variant of the CheckM algorithm to estimate contamination and completeness of an annotated genome.We report on the performance of these tools and the potential utility of the consistency score. Additionally, we provide contamination, completeness, and consistency measures for all genomes in PATRIC and in a recent set of metagenomic assemblies. EvalG and EvalCon facilitate the rapid quality control and exploration of PATRIC-annotated draft genomes.},
doi = {10.1186/s12859-019-3068-y},
journal = {BMC Bioinformatics},
number = 1,
volume = 20,
place = {United States},
year = {2019},
month = {10}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:

Works referenced in this record:

Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center
journal, November 2016

  • Wattam, Alice R.; Davis, James J.; Assaf, Rida
  • Nucleic Acids Research, Vol. 45, Issue D1
  • DOI: 10.1093/nar/gkw1017

PATRIC: The VBI PathoSystems Resource Integration Center
journal, January 2007

  • Snyder, E. E.; Kampanya, N.; Lu, J.
  • Nucleic Acids Research, Vol. 35, Issue Database
  • DOI: 10.1093/nar/gkl858

Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle
journal, January 2019


Anvi’o: an advanced analysis and visualization platform for ‘omics data
journal, January 2015

  • Eren, A. Murat; Esen, Özcan C.; Quince, Christopher
  • PeerJ, Vol. 3
  • DOI: 10.7717/peerj.1319

BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs
journal, June 2015


CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes
journal, May 2015

  • Parks, Donovan H.; Imelfort, Michael; Skennerton, Connor T.
  • Genome Research, Vol. 25, Issue 7
  • DOI: 10.1101/gr.186072.114

MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities
journal, January 2015


RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes
journal, February 2015

  • Brettin, Thomas; Davis, James J.; Disz, Terry
  • Scientific Reports, Vol. 5, Issue 1
  • DOI: 10.1038/srep08365

The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes
journal, September 2005


The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)
journal, November 2013

  • Overbeek, Ross; Olson, Robert; Pusch, Gordon D.
  • Nucleic Acids Research, Vol. 42, Issue D1
  • DOI: 10.1093/nar/gkt1226