DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Quality scores for 32,000 genomes

Abstract

More than 80% of the microbial genomes in GenBank are of ‘draft’ quality (12,553 draft vs. 2,679 finished, as of October, 2013). In this study, we have examined all the microbial DNA sequences available for complete, draft, and Sequence Read Archive genomes in GenBank as well as three other major public databases, and assigned quality scores for more than 30,000 prokaryotic genome sequences. Scores were assigned using four categories: the completeness of the assembly, the presence of full-length rRNA genes, tRNA composition and the presence of a set of 102 conserved genes in prokaryotes. Most (~88%) of the genomes had quality scores of 0.8 or better and can be safely used for standard comparative genomics analysis. We compared genomes across factors that may influence the score. We found that although sequencing depth coverage of over 100x did not ensure a better score, sequencing read length was a better indicator of sequencing quality. With few exceptions, most of the 30,000 genomes have nearly all the 102 essential genes. The score can be used to set thresholds for screening data when analyzing “all published genomes” and reference data is either not available or not applicable. The scores highlighted organisms for which commonlymore » used tools do not perform well. This information can be used to improve tools and to serve a broad group of users as more diverse organisms are sequenced. Finally and unexpectedly, the comparison of predicted tRNAs across 15,000 high quality genomes showed that anticodons beginning with an ‘A’ (codons ending with a ‘U’) are almost non-existent, with the exception of one arginine codon (CGU); this has been noted previously in the literature for a few genomes, but not with the depth found here.« less

Authors:
 [1];  [2];  [1];  [3];  [4];  [5];  [6]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Biosciences Division. Comparative Genomics Group
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Biosciences Division. Comparative Genomics Group; Univ. of Tennessee, Knoxville, TN (United States). Joint Inst. for Biological Sciences
  3. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Computer Science and Mathematics Division
  4. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Biosciences Division. Comparative Genomics Group; Univ. of Tennessee, Knoxville, TN (United States). Joint Inst. for Biological Sciences; Univ. of Tennessee, Knoxville, TN (United States). Dept. of Microbiology
  5. Technical Univ. of Denmark, Lyngby (Denmark). Center for Genomic Epidemiology
  6. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Biosciences Division. Comparative Genomics Group; Univ. of Tennessee, Knoxville, TN (United States). Joint Inst. for Biological Sciences; Technical Univ. of Denmark, Lyngby (Denmark). Dept. of Systems Biology. Center for Biological Sequence Analysis
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Biological and Environmental Research (BER); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Laboratory Directed Research and Development (LDRD)
OSTI Identifier:
1185423
Grant/Contract Number:  
AC05-00OR22725; PS02-06ER64304
Resource Type:
Accepted Manuscript
Journal Name:
Standards in Genomic Sciences
Additional Journal Information:
Journal Volume: 9; Journal ID: ISSN 1944-3277
Publisher:
BioMed Central
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; DNA; sequencing; database; quality; evaluation; status

Citation Formats

Land, Miriam L., Hyatt, Doug, Jun, Se-Ran, Kora, Guruprasad H., Hauser, Loren J., Lukjancenko, Oksana, and Ussery, David W. Quality scores for 32,000 genomes. United States: N. p., 2014. Web. doi:10.1186/1944-3277-9-20.
Land, Miriam L., Hyatt, Doug, Jun, Se-Ran, Kora, Guruprasad H., Hauser, Loren J., Lukjancenko, Oksana, & Ussery, David W. Quality scores for 32,000 genomes. United States. https://doi.org/10.1186/1944-3277-9-20
Land, Miriam L., Hyatt, Doug, Jun, Se-Ran, Kora, Guruprasad H., Hauser, Loren J., Lukjancenko, Oksana, and Ussery, David W. Mon . "Quality scores for 32,000 genomes". United States. https://doi.org/10.1186/1944-3277-9-20. https://www.osti.gov/servlets/purl/1185423.
@article{osti_1185423,
title = {Quality scores for 32,000 genomes},
author = {Land, Miriam L. and Hyatt, Doug and Jun, Se-Ran and Kora, Guruprasad H. and Hauser, Loren J. and Lukjancenko, Oksana and Ussery, David W.},
abstractNote = {More than 80% of the microbial genomes in GenBank are of ‘draft’ quality (12,553 draft vs. 2,679 finished, as of October, 2013). In this study, we have examined all the microbial DNA sequences available for complete, draft, and Sequence Read Archive genomes in GenBank as well as three other major public databases, and assigned quality scores for more than 30,000 prokaryotic genome sequences. Scores were assigned using four categories: the completeness of the assembly, the presence of full-length rRNA genes, tRNA composition and the presence of a set of 102 conserved genes in prokaryotes. Most (~88%) of the genomes had quality scores of 0.8 or better and can be safely used for standard comparative genomics analysis. We compared genomes across factors that may influence the score. We found that although sequencing depth coverage of over 100x did not ensure a better score, sequencing read length was a better indicator of sequencing quality. With few exceptions, most of the 30,000 genomes have nearly all the 102 essential genes. The score can be used to set thresholds for screening data when analyzing “all published genomes” and reference data is either not available or not applicable. The scores highlighted organisms for which commonly used tools do not perform well. This information can be used to improve tools and to serve a broad group of users as more diverse organisms are sequenced. Finally and unexpectedly, the comparison of predicted tRNAs across 15,000 high quality genomes showed that anticodons beginning with an ‘A’ (codons ending with a ‘U’) are almost non-existent, with the exception of one arginine codon (CGU); this has been noted previously in the literature for a few genomes, but not with the depth found here.},
doi = {10.1186/1944-3277-9-20},
journal = {Standards in Genomic Sciences},
number = ,
volume = 9,
place = {United States},
year = {Mon Dec 08 00:00:00 EST 2014},
month = {Mon Dec 08 00:00:00 EST 2014}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 30 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

A Semantic Web Management Model for Integrative Biomedical Informatics
journal, August 2008


TOLKIN – Tree of Life Knowledge and Information Network: Filling a Gap for Collaborative Research in Biological Systematics
journal, June 2012


Recent Directions in Compressing Next Generation Sequencing Data
journal, March 2012

  • Bhattacharyya, Malay; Bhattacharyya, Manas; Bandyopadhyay, Sanghamitra
  • Current Bioinformatics, Vol. 7, Issue 1
  • DOI: 10.2174/157489312799304422

The Fast Changing Landscape of Sequencing Technologies and Their Impact on Microbial Genome Assemblies and Annotation
journal, December 2012


The Value of Complete Microbial Genome Sequencing (You Get What You Pay For)
journal, December 2002


Genome Project Standards in a New Era of Sequencing
journal, October 2009


GenBank
journal, November 2012

  • Benson, Dennis A.; Cavanaugh, Mark; Clark, Karen
  • Nucleic Acids Research, Vol. 41, Issue D1
  • DOI: 10.1093/nar/gks1195

Multilocus Sequence Typing of Total-Genome-Sequenced Bacteria
journal, January 2012

  • Larsen, M. V.; Cosentino, S.; Rasmussen, S.
  • Journal of Clinical Microbiology, Vol. 50, Issue 4
  • DOI: 10.1128/JCM.06094-11

Pfam: the protein families database
journal, November 2013

  • Finn, Robert D.; Bateman, Alex; Clements, Jody
  • Nucleic Acids Research, Vol. 42, Issue D1
  • DOI: 10.1093/nar/gkt1223

tRNAscan-SE: A Program for Improved Detection of Transfer RNA Genes in Genomic Sequence
journal, March 1997


RNAmmer: consistent and rapid annotation of ribosomal RNA genes
journal, April 2007

  • Lagesen, Karin; Hallin, Peter; Rødland, Einar Andreas
  • Nucleic Acids Research, Vol. 35, Issue 9
  • DOI: 10.1093/nar/gkm160

Prodigal: prokaryotic gene recognition and translation initiation site identification
journal, March 2010


HMMER web server: interactive sequence similarity searching
journal, May 2011

  • Finn, R. D.; Clements, J.; Eddy, S. R.
  • Nucleic Acids Research, Vol. 39, Issue suppl
  • DOI: 10.1093/nar/gkr367

GtRNAdb: a database of transfer RNA genes detected in genomic sequence
journal, January 2009

  • Chan, P. P.; Lowe, T. M.
  • Nucleic Acids Research, Vol. 37, Issue Database
  • DOI: 10.1093/nar/gkn787

The advantages of SMRT sequencing
journal, June 2013

  • Roberts, Richard J.; Carneiro, Mauricio O.; Schatz, Michael C.
  • Genome Biology, Vol. 14, Issue 6
  • DOI: 10.1186/gb-2013-14-6-405

Spatiotemporal persistence of multiple, diverse clades and toxins of Corynebacterium diphtheriae
journal, March 2021

  • Will, Robert C.; Ramamurthy, Thandavarayan; Sharma, Naresh Chand
  • Nature Communications, Vol. 12, Issue 1
  • DOI: 10.1038/s41467-021-21870-5

The advantages of SMRT sequencing
journal, July 2013

  • Roberts, Richard J.; Carneiro, Mauricio O.; Schatz, Michael C.
  • Genome Biology, Vol. 14, Issue 7
  • DOI: 10.1186/gb-2013-14-7-405

tRNAscan-SE: A Program for Improved Detection of Transfer RNA Genes in Genomic Sequence
journal, March 1997


RNAmmer: consistent and rapid annotation of ribosomal RNA genes
journal, April 2007

  • Lagesen, Karin; Hallin, Peter; Rødland, Einar Andreas
  • Nucleic Acids Research, Vol. 35, Issue 9
  • DOI: 10.1093/nar/gkm160

GtRNAdb: a database of transfer RNA genes detected in genomic sequence
journal, January 2009

  • Chan, P. P.; Lowe, T. M.
  • Nucleic Acids Research, Vol. 37, Issue Database
  • DOI: 10.1093/nar/gkn787

HMMER web server: interactive sequence similarity searching
journal, May 2011

  • Finn, R. D.; Clements, J.; Eddy, S. R.
  • Nucleic Acids Research, Vol. 39, Issue suppl
  • DOI: 10.1093/nar/gkr367

Pfam: the protein families database
journal, November 2013

  • Finn, Robert D.; Bateman, Alex; Clements, Jody
  • Nucleic Acids Research, Vol. 42, Issue D1
  • DOI: 10.1093/nar/gkt1223

The DNA data deluge
journal, July 2013


PATRIC: the Comprehensive Bacterial Bioinformatics Resource with a Focus on Human Pathogenic Species
journal, September 2011

  • Gillespie, Joseph J.; Wattam, Alice R.; Cammer, Stephen A.
  • Infection and Immunity, Vol. 79, Issue 11
  • DOI: 10.1128/iai.00207-11

Genome Sequence of Thermofilum pendens Reveals an Exceptional Loss of Biosynthetic Pathways without Genome Reduction
journal, February 2008

  • Anderson, I.; Rodriguez, J.; Susanti, D.
  • Journal of Bacteriology, Vol. 190, Issue 8
  • DOI: 10.1128/jb.01949-07

Prodigal: prokaryotic gene recognition and translation initiation site identification
journal, March 2010


The advantages of SMRT sequencing
journal, June 2013

  • Roberts, Richard J.; Carneiro, Mauricio O.; Schatz, Michael C.
  • Genome Biology, Vol. 14, Issue 6
  • DOI: 10.1186/gb-2013-14-6-405

A Semantic Web Management Model for Integrative Biomedical Informatics
journal, August 2008


The Fast Changing Landscape of Sequencing Technologies and Their Impact on Microbial Genome Assemblies and Annotation
journal, December 2012


Works referencing / citing this record:

Bioprospecting Archaea: Focus on Extreme Halophiles
book, December 2016


Insights from 20 years of bacterial genome sequencing
journal, February 2015

  • Land, Miriam; Hauser, Loren; Jun, Se-Ran
  • Functional & Integrative Genomics, Vol. 15, Issue 2
  • DOI: 10.1007/s10142-015-0433-4

FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science
journal, July 2019


Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea
journal, December 2019


Microbiome analyses of blood and tissues suggest cancer diagnostic approach
journal, March 2020


Population genomic datasets describing the post-vaccine evolutionary epidemiology of Streptococcus pneumoniae
journal, October 2015

  • Croucher, Nicholas J.; Finkelstein, Jonathan A.; Pelton, Stephen I.
  • Scientific Data, Vol. 2, Issue 1
  • DOI: 10.1038/sdata.2015.58

Genomic characterization of Nontuberculous Mycobacteria
journal, March 2017

  • Fedrizzi, Tarcisio; Meehan, Conor J.; Grottola, Antonella
  • Scientific Reports, Vol. 7, Issue 1
  • DOI: 10.1038/srep45258

Genome Evolution of Bartonellaceae Symbionts of Ants at the Opposite Ends of the Trophic Scale
journal, July 2018

  • Bisch, Gaelle; Neuvonen, Minna-Maria; Pierce, Naomi E.
  • Genome Biology and Evolution, Vol. 10, Issue 7
  • DOI: 10.1093/gbe/evy126

What can we learn from over 100,000 Escherichia coli genomes?
journal, January 2020

  • Abram, Kaleb; Udaondo, Zulema; Bleker, Carissa
  • Communications Biology
  • DOI: 10.1101/708131

Assessment of genome annotation using gene function similarity within the gene neighborhood
journal, July 2017


FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science
journal, July 2019


Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea
journal, December 2019


Pan4Draft: A Computational Tool to Improve the Accuracy of Pan-Genomic Analysis Using Draft Genomes
journal, June 2018


Genome Evolution of Bartonellaceae Symbionts of Ants at the Opposite Ends of the Trophic Scale
journal, July 2018

  • Bisch, Gaelle; Neuvonen, Minna-Maria; Pierce, Naomi E.
  • Genome Biology and Evolution, Vol. 10, Issue 7
  • DOI: 10.1093/gbe/evy126

The landscape of microbial phenotypic traits and associated genes
journal, October 2016

  • Brbić, Maria; Piškorec, Matija; Vidulin, Vedrana
  • Nucleic Acids Research
  • DOI: 10.1093/nar/gkw964

dBBQs : dataBase of Bacterial Quality scores
journal, September 2017

  • Wanchai, Visanu; Patumcharoenpol, Preecha; Nookaew, Intawat
  • BMC Bioinformatics
  • DOI: 10.1101/187641

Analysis of Draft Genome Sequence of Pseudomonas sp. QTF5 Reveals Its Benzoic Acid Degradation Ability and Heavy Metal Tolerance
journal, January 2017


Assessment of genome annotation using gene function similarity within the gene neighborhood
journal, July 2017


Molecular tools in understanding the evolution of Vibrio cholerae
journal, October 2015

  • Rahaman, Md. Habibur; Islam, Tarequl; Colwell, Rita R.
  • Frontiers in Microbiology, Vol. 6
  • DOI: 10.3389/fmicb.2015.01040

Arcobacter cryaerophilus Isolated From New Zealand Mussels Harbor a Putative Virulence Plasmid
journal, August 2019

  • On, Stephen L. W.; Althaus, Damien; Miller, William G.
  • Frontiers in Microbiology, Vol. 10
  • DOI: 10.3389/fmicb.2019.01802

Quality Assessment of Domesticated Animal Genome Assemblies
journal, January 2015

  • Seemann, Stefan E.; Anthon, Christian; Palasca, Oana
  • Bioinformatics and Biology Insights, Vol. 9S4
  • DOI: 10.4137/bbi.s29333