DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Rapid phylogenetic and functional classification of short genomic fragments with signature peptides

Abstract

Background: Classification is difficult for shotgun metagenomics data from environments such as soils, where the diversity of sequences is high and where reference sequences from close relatives may not exist. Approaches based on sequence-similarity scores must deal with the confounding effects that inheritance and functional pressures exert on the relation between scores and phylogenetic distance, while approaches based on sequence alignment and tree-building are typically limited to a small fraction of gene families. We describe an approach based on finding one or more exact matches between a read and a precomputed set of peptide 10-mers. Results: At even the largest phylogenetic distances, thousands of 10-mer peptide exact matches can be found between pairs of bacterial genomes. Genes that share one or more peptide 10-mers typically have high reciprocal BLAST scores. Among a set of 403 representative bacterial genomes, some 20 million 10-mer peptides were found to be shared. We assign each of these peptides as a signature of a particular node in a phylogenetic reference tree based on the RNA polymerase genes. We classify the phylogeny of a genomic fragment (e.g., read) at the most specific node on the reference tree that is consistent with the phylogeny of observed signaturemore » peptides it contains. Using both synthetic data from four newly-sequenced soil-bacterium genomes and ten real soil metagenomics data sets, we demonstrate a sensitivity and specificity comparable to that of the MEGAN metagenomics analysis package using BLASTX against the NR database. Phylogenetic and functional similarity metrics applied to real metagenomics data indicates a signal-to-noise ratio of approximately 400 for distinguishing among environments. Our method assigns ~6.6 Gbp/hr on a single CPU, compared with 25 kbp/hr for methods based on BLASTX against the NR database. Conclusions: Classification by exact matching against a precomputed list of signature peptides provides comparable results to existing techniques for reads longer than about 300 bp and does not degrade severely with shorter reads. Orders of magnitude faster than existing methods, the approach is suitable now for inclusion in analysis pipelines and appears to be extensible in several different directions.« less

Authors:
 [1];  [2];  [3];  [3];  [4];  [2];  [4];  [4]
  1. Los Alamos National Lab. (LANL), Los Alamos, NM (United States). Physics Division
  2. Los Alamos National Lab. (LANL), Los Alamos, NM (United States). Theoretical Div.
  3. Los Alamos National Lab. (LANL), Los Alamos, NM (United States). Computer, Computational, and Statistical Sciences Division
  4. Los Alamos National Lab. (LANL), Los Alamos, NM (United States). Bioscience Division
Publication Date:
Research Org.:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Biological and Environmental Research (BER). Biological Systems Science Division; USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
1629287
Grant/Contract Number:  
W-7405-ENG-36
Resource Type:
Accepted Manuscript
Journal Name:
BMC Research Notes
Additional Journal Information:
Journal Volume: 5; Journal Issue: 1; Journal ID: ISSN 1756-0500
Publisher:
BioMed Central
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; signature peptide; reference genome; reference database; phylogenetic distance; phylogenetic profile

Citation Formats

Berendzen, Joel, Bruno, William J., Cohn, Judith D., Hengartner, Nicolas W., Kuske, Cheryl R., McMahon, Benjamin H., Wolinsky, Murray A., and Xie, Gary. Rapid phylogenetic and functional classification of short genomic fragments with signature peptides. United States: N. p., 2012. Web. doi:10.1186/1756-0500-5-460.
Berendzen, Joel, Bruno, William J., Cohn, Judith D., Hengartner, Nicolas W., Kuske, Cheryl R., McMahon, Benjamin H., Wolinsky, Murray A., & Xie, Gary. Rapid phylogenetic and functional classification of short genomic fragments with signature peptides. United States. https://doi.org/10.1186/1756-0500-5-460
Berendzen, Joel, Bruno, William J., Cohn, Judith D., Hengartner, Nicolas W., Kuske, Cheryl R., McMahon, Benjamin H., Wolinsky, Murray A., and Xie, Gary. Tue . "Rapid phylogenetic and functional classification of short genomic fragments with signature peptides". United States. https://doi.org/10.1186/1756-0500-5-460. https://www.osti.gov/servlets/purl/1629287.
@article{osti_1629287,
title = {Rapid phylogenetic and functional classification of short genomic fragments with signature peptides},
author = {Berendzen, Joel and Bruno, William J. and Cohn, Judith D. and Hengartner, Nicolas W. and Kuske, Cheryl R. and McMahon, Benjamin H. and Wolinsky, Murray A. and Xie, Gary},
abstractNote = {Background: Classification is difficult for shotgun metagenomics data from environments such as soils, where the diversity of sequences is high and where reference sequences from close relatives may not exist. Approaches based on sequence-similarity scores must deal with the confounding effects that inheritance and functional pressures exert on the relation between scores and phylogenetic distance, while approaches based on sequence alignment and tree-building are typically limited to a small fraction of gene families. We describe an approach based on finding one or more exact matches between a read and a precomputed set of peptide 10-mers. Results: At even the largest phylogenetic distances, thousands of 10-mer peptide exact matches can be found between pairs of bacterial genomes. Genes that share one or more peptide 10-mers typically have high reciprocal BLAST scores. Among a set of 403 representative bacterial genomes, some 20 million 10-mer peptides were found to be shared. We assign each of these peptides as a signature of a particular node in a phylogenetic reference tree based on the RNA polymerase genes. We classify the phylogeny of a genomic fragment (e.g., read) at the most specific node on the reference tree that is consistent with the phylogeny of observed signature peptides it contains. Using both synthetic data from four newly-sequenced soil-bacterium genomes and ten real soil metagenomics data sets, we demonstrate a sensitivity and specificity comparable to that of the MEGAN metagenomics analysis package using BLASTX against the NR database. Phylogenetic and functional similarity metrics applied to real metagenomics data indicates a signal-to-noise ratio of approximately 400 for distinguishing among environments. Our method assigns ~6.6 Gbp/hr on a single CPU, compared with 25 kbp/hr for methods based on BLASTX against the NR database. Conclusions: Classification by exact matching against a precomputed list of signature peptides provides comparable results to existing techniques for reads longer than about 300 bp and does not degrade severely with shorter reads. Orders of magnitude faster than existing methods, the approach is suitable now for inclusion in analysis pipelines and appears to be extensible in several different directions.},
doi = {10.1186/1756-0500-5-460},
journal = {BMC Research Notes},
number = 1,
volume = 5,
place = {United States},
year = {Tue Aug 28 00:00:00 EDT 2012},
month = {Tue Aug 28 00:00:00 EDT 2012}
}

Works referenced in this record:

A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea
journal, December 2009

  • Wu, Dongying; Hugenholtz, Philip; Mavromatis, Konstantinos
  • Nature, Vol. 462, Issue 7276
  • DOI: 10.1038/nature08656

A Comparison of rpoB and 16S rRNA as Markers in Pyrosequencing Studies of Bacterial Diversity
journal, February 2012


SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences
journal, May 2009


Uprooting the Tree of Life
journal, February 2000


Polymerase chain reaction primers miss half of rRNA microbial diversity
journal, August 2009

  • Hong, SunHee; Bunge, John; Leslin, Chesley
  • The ISME Journal, Vol. 3, Issue 12
  • DOI: 10.1038/ismej.2009.89

Evidence for a Gram-positive, Eubacterial Root of the Tree of Life
journal, April 2007

  • Skophammer, R. G.; Servin, J. A.; Herbold, C. W.
  • Molecular Biology and Evolution, Vol. 24, Issue 8
  • DOI: 10.1093/molbev/msm096

Environmental distribution of prokaryotic taxa
journal, January 2010

  • Tamames, Javier; Abellan, Juan Jose; Pignatelli, Miguel
  • BMC Microbiology, Vol. 10, Issue 1
  • DOI: 10.1186/1471-2180-10-85

BLAT---The BLAST-Like Alignment Tool
journal, March 2002


Uprooting the Tree of Life
journal, February 2000


MetaSim—A Sequencing Simulator for Genomics and Metagenomics
journal, October 2008


MEGAN analysis of metagenomic data
journal, February 2007

  • Huson, D. H.; Auch, A. F.; Qi, J.
  • Genome Research, Vol. 17, Issue 3
  • DOI: 10.1101/gr.5969107

WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads
journal, December 2009

  • Gerlach, Wolfgang; Jünemann, Sebastian; Tille, Felix
  • BMC Bioinformatics, Vol. 10, Issue 1
  • DOI: 10.1186/1471-2105-10-430

High Frequency of Horizontal Gene Transfer in the Oceans
journal, September 2010


3D domain swapping: A mechanism for oligomer assembly
journal, December 1995

  • Bennett, Melanie J.; Schlunegger, Michael P.; Eisenberg, David
  • Protein Science, Vol. 4, Issue 12
  • DOI: 10.1002/pro.5560041202

The National Microbial Pathogen Database Resource (NMPDR): a genomics platform based on subsystem annotation
journal, January 2007

  • McNeil, L. K.; Reich, C.; Aziz, R. K.
  • Nucleic Acids Research, Vol. 35, Issue Database
  • DOI: 10.1093/nar/gkl947

Improving the specificity of high-throughput ortholog prediction
text, January 2006

  • Fulton, Debra L.; Li, Yvonne Y.; Laird, Matthew R.
  • BioMed Central
  • DOI: 10.14288/1.0215896

Systematic artifacts in metagenomes from complex microbial communities
journal, July 2009

  • Gomez-Alvarez, Vicente; Teal, Tracy K.; Schmidt, Thomas M.
  • The ISME Journal, Vol. 3, Issue 11
  • DOI: 10.1038/ismej.2009.72

MetaSim—A Sequencing Simulator for Genomics and Metagenomics
journal, October 2008


Drivers of bacterial  -diversity depend on spatial scale
journal, April 2011

  • Martiny, J. B. H.; Eisen, J. A.; Penn, K.
  • Proceedings of the National Academy of Sciences, Vol. 108, Issue 19
  • DOI: 10.1073/pnas.1016308108

The Pfam protein families database
journal, January 2004

  • Bateman, Alex; Coin, Lachlan; Durbin, Richard
  • Nucleic Acids Research, Vol. 32, Issue S1, p. D138-D141
  • DOI: 10.1093/nar/gkh121

Harnessing the power of the human microbiome
journal, April 2010


Local homology recognition and distance measures in linear time using compressed amino acid alphabets
journal, January 2004


Evidence for a Gram-positive, Eubacterial Root of the Tree of Life
journal, April 2007

  • Skophammer, R. G.; Servin, J. A.; Herbold, C. W.
  • Molecular Biology and Evolution, Vol. 24, Issue 8
  • DOI: 10.1093/molbev/msm096

High Frequency of Horizontal Gene Transfer in the Oceans
journal, September 2010


SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences
journal, May 2009


Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)
journal, May 2011

  • Sims, G. E.; Kim, S. -H.
  • Proceedings of the National Academy of Sciences, Vol. 108, Issue 20
  • DOI: 10.1073/pnas.1105168108

MEGAN analysis of metagenomic data
journal, February 2007

  • Huson, D. H.; Auch, A. F.; Qi, J.
  • Genome Research, Vol. 17, Issue 3
  • DOI: 10.1101/gr.5969107

Search and clustering orders of magnitude faster than BLAST
journal, August 2010


A Comparison of rpoB and 16S rRNA as Markers in Pyrosequencing Studies of Bacterial Diversity
journal, February 2012


The Phylogenetic Diversity of Metagenomes
journal, August 2011


Orthologous Transcription Factors in Bacteria Have Different Functions and Regulate Different Genes
journal, September 2007


Metagenome Fragment Classification Using -Mer Frequency Profiles
journal, January 2008

  • Rosen, Gail; Garbarine, Elaine; Caseiro, Diamantino
  • Advances in Bioinformatics, Vol. 2008
  • DOI: 10.1155/2008/205969

Weighted Neighbor Joining: A Likelihood-Based Approach to Distance-Based Phylogeny Reconstruction
journal, January 2000


Modeling residue usage in aligned protein sequences via maximum likelihood
journal, December 1996


Modeling residue usage in aligned protein sequences via maximum likelihood
journal, December 1996


Weighted Neighbor Joining: A Likelihood-Based Approach to Distance-Based Phylogeny Reconstruction
journal, January 2000


The metagenomics of soil
journal, June 2005


Artificial and natural duplicates in pyrosequencing reads of metagenomic data
journal, January 2010


FIGfams: yet another set of protein families
journal, September 2009

  • Meyer, Folker; Overbeek, Ross; Rodriguez, Alex
  • Nucleic Acids Research, Vol. 37, Issue 20
  • DOI: 10.1093/nar/gkp698

A simple, fast, and accurate method of phylogenomic inference
journal, January 2008


The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes
journal, September 2005


The Ribosomal Database Project: improved alignments and new tools for rRNA analysis
journal, January 2009

  • Cole, J. R.; Wang, Q.; Cardenas, E.
  • Nucleic Acids Research, Vol. 37, Issue Database
  • DOI: 10.1093/nar/gkn879

3D domain swapping: A mechanism for oligomer assembly
journal, December 1995

  • Bennett, Melanie J.; Schlunegger, Michael P.; Eisenberg, David
  • Protein Science, Vol. 4, Issue 12
  • DOI: 10.1002/pro.5560041202

The Pfam protein families database
journal, January 2004

  • Bateman, Alex; Coin, Lachlan; Durbin, Richard
  • Nucleic Acids Research, Vol. 32, Issue S1, p. D138-D141
  • DOI: 10.1093/nar/gkh121

Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs)
journal, May 2011

  • Sims, G. E.; Kim, S. -H.
  • Proceedings of the National Academy of Sciences, Vol. 108, Issue 20
  • DOI: 10.1073/pnas.1105168108

The PROSITE database
journal, January 2006


Systematic artifacts in metagenomes from complex microbial communities
journal, July 2009

  • Gomez-Alvarez, Vicente; Teal, Tracy K.; Schmidt, Thomas M.
  • The ISME Journal, Vol. 3, Issue 11
  • DOI: 10.1038/ismej.2009.72

Search and clustering orders of magnitude faster than BLAST
journal, August 2010


EMBOSS: The European Molecular Biology Open Software Suite
journal, June 2000


Harnessing the power of the human microbiome
journal, April 2010


MLTreeMap - accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies
journal, January 2010


Drivers of bacterial  -diversity depend on spatial scale
journal, April 2011

  • Martiny, J. B. H.; Eisen, J. A.; Penn, K.
  • Proceedings of the National Academy of Sciences, Vol. 108, Issue 19
  • DOI: 10.1073/pnas.1016308108

Metagenomic Sequencing of an In Vitro-Simulated Microbial Community
journal, April 2010


A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea
journal, December 2009

  • Wu, Dongying; Hugenholtz, Philip; Mavromatis, Konstantinos
  • Nature, Vol. 462, Issue 7276
  • DOI: 10.1038/nature08656

Environmental distribution of prokaryotic taxa
journal, January 2010

  • Tamames, Javier; Abellan, Juan Jose; Pignatelli, Miguel
  • BMC Microbiology, Vol. 10, Issue 1
  • DOI: 10.1186/1471-2180-10-85

Basic local alignment search tool
journal, October 1990

  • Altschul, Stephen F.; Gish, Warren; Miller, Webb
  • Journal of Molecular Biology, Vol. 215, Issue 3, p. 403-410
  • DOI: 10.1016/S0022-2836(05)80360-2

Codon-Substitution Models for Heterogeneous Selection Pressure at Amino Acid Sites
journal, May 2000


Structural and functional constraints in the evolution of protein families
journal, September 2009

  • Worth, Catherine L.; Gong, Sungsam; Blundell, Tom L.
  • Nature Reviews Molecular Cell Biology, Vol. 10, Issue 10
  • DOI: 10.1038/nrm2762

DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences
journal, October 2010

  • Ghosh, Tarini Shankar; Haque M., Monzoorul; Mande, Sharmila S.
  • BMC Bioinformatics, Vol. 11, Issue S7
  • DOI: 10.1186/1471-2105-11-S7-S14

The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific
journal, March 2007


The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes
journal, September 2008


Deriving enzymatic and taxonomic signatures of metagenomes from short read data
journal, July 2010


Predicting conserved protein motifs with Sub-HMMs
journal, April 2010


FIGfams: yet another set of protein families
journal, September 2009

  • Meyer, Folker; Overbeek, Ross; Rodriguez, Alex
  • Nucleic Acids Research, Vol. 37, Issue 20
  • DOI: 10.1093/nar/gkp698

The PROSITE database
journal, January 2006


MUSCLE: multiple sequence alignment with high accuracy and high throughput
journal, March 2004

  • Edgar, R. C.
  • Nucleic Acids Research, Vol. 32, Issue 5, p. 1792-1797
  • DOI: 10.1093/nar/gkh340

Local homology recognition and distance measures in linear time using compressed amino acid alphabets
journal, January 2004


Crotonobetaine reductase fromEscherichia coli ? a new inducible enzyme of anaerobic metabolization of L(-)-carnitine
journal, January 1994

  • Roth, Sylke; Jung, Kirsten; Jung, Heinrich
  • Antonie van Leeuwenhoek, Vol. 65, Issue 1
  • DOI: 10.1007/BF00878280

Improving the specificity of high-throughput ortholog prediction
journal, May 2006

  • Fulton, Debra L.; Li, Yvonne Y.; Laird, Matthew R.
  • BMC Bioinformatics, Vol. 7, Issue 1
  • DOI: 10.1186/1471-2105-7-270

Metagenomics: Read Length Matters
journal, January 2008

  • Wommack, K. E.; Bhavsar, J.; Ravel, J.
  • Applied and Environmental Microbiology, Vol. 74, Issue 5
  • DOI: 10.1128/AEM.02181-07

Evolution by Gene Duplication
book, January 1970


Orthologous Transcription Factors in Bacteria Have Different Functions and Regulate Different Genes
journal, September 2007


Rubellimicrobium mesophilum sp. nov., a mesophilic, pigmented bacterium isolated from soil
journal, August 2008

  • Dastager, S. G.; Lee, J. -C.; Ju, Y. -J.
  • INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY, Vol. 58, Issue 8
  • DOI: 10.1099/ijs.0.65590-0

The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes
journal, September 2005


The National Microbial Pathogen Database Resource (NMPDR): a genomics platform based on subsystem annotation
journal, January 2007

  • McNeil, L. K.; Reich, C.; Aziz, R. K.
  • Nucleic Acids Research, Vol. 35, Issue Database
  • DOI: 10.1093/nar/gkl947

BLAT---The BLAST-Like Alignment Tool
journal, March 2002


RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data
journal, October 2011


WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads
journal, December 2009

  • Gerlach, Wolfgang; Jünemann, Sebastian; Tille, Felix
  • BMC Bioinformatics, Vol. 10, Issue 1
  • DOI: 10.1186/1471-2105-10-430

Artificial and natural duplicates in pyrosequencing reads of metagenomic data
journal, January 2010


The Ribosomal Database Project: improved alignments and new tools for rRNA analysis
journal, January 2009

  • Cole, J. R.; Wang, Q.; Cardenas, E.
  • Nucleic Acids Research, Vol. 37, Issue Database
  • DOI: 10.1093/nar/gkn879

Biopython: freely available Python tools for computational molecular biology and bioinformatics
journal, March 2009


Works referencing / citing this record:

High-Specificity Targeted Functional Profiling in Microbial Communities with ShortBRED
journal, December 2015


Environmental genes and genomes: understanding the differences and challenges in the approaches and software for their analyses
journal, February 2015

  • Zepeda Mendoza, Marie Lisandra; Sicheritz-Pontén, Thomas; Gilbert, M. Thomas P.
  • Briefings in Bioinformatics, Vol. 16, Issue 5
  • DOI: 10.1093/bib/bbv001

SUPER-FOCUS: a tool for agile functional analysis of shotgun metagenomic data
journal, October 2015


Accurate read-based metagenome characterization using a hierarchical suite of unique signatures
journal, March 2015

  • Freitas, Tracey Allen K.; Li, Po-E; Scholz, Matthew B.
  • Nucleic Acids Research, Vol. 43, Issue 10
  • DOI: 10.1093/nar/gkv180

SUPER-FOCUS: a tool for agile functional analysis of shotgun metagenomic data
journal, October 2015


Rapid sequence identification of potential pathogens using techniques from sparse linear algebra
conference, April 2015

  • Dodson, Stephanie; Ricke, Darrell O.; Kepner, Jeremy
  • 2015 IEEE International Symposium on Technologies for Homeland Security (HST)
  • DOI: 10.1109/ths.2015.7225316

California condor microbiomes: Bacterial variety and functional properties in captive-bred individuals
journal, December 2019


From cultured to uncultured genome sequences: metagenomics and modeling microbial ecosystems
journal, August 2015


Scalable metagenomic taxonomy classification using a reference genome database
journal, July 2013


Validation of high throughput sequencing and microbial forensics applications
journal, January 2014

  • Budowle, Bruce; Connell, Nancy D.; Bielecka-Oder, Anna
  • Investigative Genetics, Vol. 5, Issue 1
  • DOI: 10.1186/2041-2223-5-9

Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities
journal, November 2015


California condor microbiomes: Bacterial variety and functional properties in captive-bred individuals
journal, December 2019


Scalable metagenomic taxonomy classification using a reference genome database
journal, July 2013


High-Specificity Targeted Functional Profiling in Microbial Communities with ShortBRED
journal, December 2015


Accurate read-based metagenome characterization using a hierarchical suite of unique signatures
journal, March 2015

  • Freitas, Tracey Allen K.; Li, Po-E; Scholz, Matthew B.
  • Nucleic Acids Research, Vol. 43, Issue 10
  • DOI: 10.1093/nar/gkv180

Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities
journal, November 2015