skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Binning sequences using very sparse labels within a metagenome

Journal Article · · BMC Bioinformatics
 [1];  [1];  [1];  [2]
  1. Univ. of Melbourne (Australia). Dept. of Mechanical Engineering. Dynamic Systems and Control Group
  2. Academia Sinica, Taipei (Taiwan). Research Centre for Biodiversity

Background: In metagenomic studies, a process called binning is necessary to assign contigs that belong to multiple species to their respective phylogenetic groups. Most of the current methods of binning, such as BLAST, k-mer and PhyloPythia, involve assigning sequence fragments by comparing sequence similarity or sequence composition with already-sequenced genomes that are still far from comprehensive. We propose a semi-supervised seeding method for binning that does not depend on knowledge of completed genomes. Instead, it extracts the flanking sequences of highly conserved 16S rRNA from the metagenome and uses them as seeds (labels) to assign other reads based on their compositional similarity. Results: The proposed seeding method is implemented on an unsupervised Growing Self-Organising Map (GSOM), and called Seeded GSOM (S-GSOM). We compared it with four well-known semi-supervised learning methods in a preliminary test, separating random-length prokaryotic sequence fragments sampled from the NCBI genome database. We identified the flanking sequences of the highly conserved 16S rRNA as suitable seeds that could be used to group the sequence fragments according to their species. S-GSOM showed superior performance compared to the semi-supervised methods tested. Additionally, S-GSOM may also be used to visually identify some species that do not have seeds. The proposed method was then applied to simulated metagenomic datasets using two different confidence threshold settings and compared with PhyloPythia, k-mer and BLAST. At the reference taxonomic level Order, S-GSOM outperformed all k-mer and BLAST results and showed comparable results with PhyloPythia for each of the corresponding confidence settings, where S-GSOM performed better than PhyloPythia in the ≥ 10 reads datasets and comparable in the ≥ 8 kb benchmark tests. Conclusion: In the task of binning using semi-supervised learning methods, results indicate S-GSOM to be the best of the methods tested. Most importantly, the proposed method does not require knowledge from known genomes and uses only very few labels (one per species is sufficient in most cases), which are extracted from the metagenome itself. These advantages make it a very attractive binning method. SGSOM outperformed the binning methods that depend on already-sequenced genomes, and compares well to the current most advanced binning method, PhyloPythia.

Research Organization:
USDOE Joint Genome Institute (JGI), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
OSTI ID:
1626354
Journal Information:
BMC Bioinformatics, Vol. 9, Issue 1; ISSN 1471-2105
Publisher:
BioMed CentralCopyright Statement
Country of Publication:
United States
Language:
English

References (39)

Community structure and metabolism through reconstruction of microbial genomes from the environment journal February 2004
Environmental Genome Shotgun Sequencing of the Sargasso Sea journal April 2004
Comparative Metagenomics of Microbial Communities journal April 2005
Symbiosis insights through metagenomic analysis of a microbial consortium journal September 2006
The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific journal March 2007
The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families journal March 2007
Use of simulated data sets to evaluate the fidelity of metagenomic processing methods journal April 2007
Comparative Genomic Structure of Prokaryotes journal December 2004
Codon Usage Domains over Bacterial Chromosomes journal April 2006
Compositional biases of bacterial genomes and evolutionary implications. journal January 1997
Capturing Whole-Genome Characteristics in Short Sequences Using a Naive Bayesian Classifier journal August 2001
Genomic signature: characterization and classification of species assessed by chaos game representation of sequences journal October 1999
Informatics for Unveiling Hidden Genome Signatures journal April 2003
Novel Phylogenetic Studies of Genomic Sequence Fragments Derived from Uncultured Microbe Mixtures in Environmental and Clinical Samples journal January 2005
Accurate phylogenetic classification of variable-length DNA fragments journal December 2006
Application of tetranucleotide frequencies for the assignment of genomic fragments journal September 2004
TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences journal October 2004
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities journal January 2005
Dynamic self-organizing maps with controlled growth for knowledge discovery journal May 2000
Using Growing Self-Organising Maps to Improve the Binning Process in Environmental Whole-Genome Shotgun Sequencing journal January 2008
16S rRNA sequences reveal numerous uncultured microorganisms in a natural community journal May 1990
Genotypic Diversity Within a Natural Coastal Bacterioplankton Population journal February 2005
The self-organizing map journal January 1990
Self-Organizing Maps book January 2001
Enhancement of topology preservation and hierarchical dynamic self-organising maps for data visualisation journal February 2003
An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data journal October 2003
Semi-supervised Learning of Dynamic Self-Organising Maps book January 2006
Seeded region growing journal June 1994
A Novel Transductive SVM for Semisupervised Classification of Remote-Sensing Images journal November 2006
Comparing partitions journal December 1985
ARACHNE: A Whole-Genome Shotgun Assembler journal January 2002
Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes journal July 2002
Deciphering the evolution and metabolism of an anammox bacterium from a community genome journal April 2006
Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities journal September 2006
Comparative analysis of environmental sequences: potential and challenges journal February 2006
Genomics for environmental microbiology journal June 2006
Controlling the spread of dynamic self-organising maps journal May 2004
Visualizing High-Dimensional Structure with the Incremental Grid Growing Neural Network conference January 1995
Analysis of processes and large data sets by a self-organizing method conference January 1999

Cited By (25)

cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data journal August 2010
A Primer on Metagenomics journal February 2010
GraphBin2: Refined and Overlapped Binning of Metagenomic Contigs Using Assembly Graphs text January 2020
Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data journal November 2015
Metagenomics - a guide from sampling to data analysis journal February 2012
A clinician's guide to microbiome analysis journal August 2017
Classification of metagenomic sequences: methods and challenges journal September 2012
Current opportunities and challenges in microbial metagenome analysis--a bioinformatic perspective journal September 2012
Positive-Unlabeled Learning for inferring drug interactions based on heterogeneous attributes journal March 2017
A spectacular anomaly in the 4-mer composition of the giant pandoravirus genomes reveals a stringent new evolutionary selection process posted_content January 2019
MBBC: an efficient approach for metagenomic binning based on clustering journal February 2015
A Puzzling Anomaly in the 4-Mer Composition of the Giant Pandoravirus Genomes Reveals a Stringent New Evolutionary Selection Process journal December 2019
A two-tiered unsupervised clustering approach for drug repositioning through heterogeneous data integration journal April 2018
Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data text January 2015
From cultured to uncultured genome sequences: metagenomics and modeling microbial ecosystems journal August 2015
Distinguishing Microbial Genome Fragments Based on Their Composition: Evolutionary and Comparative Genomic Perspectives journal January 2010
Unsupervised statistical clustering of environmental shotgun sequences journal October 2009
TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach journal February 2009
RAIphy: Phylogenetic classification of metagenomics samples using iterative refinement of relative abundance index profiles journal January 2011
The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments journal January 2009
Separating metagenomic short reads into genomes via clustering journal September 2012
Community-wide analysis of microbial genome sequence signatures journal January 2009
Reconstructing the Genomic Content of Microbiome Taxa through Shotgun Metagenomic Deconvolution journal October 2013
Classification of Metagenomics Data at Lower Taxonomic Level Using a Robust Supervised Classifier journal January 2015
The Use of Metagenomic Approaches to Analyze Changes in Microbial Communities journal January 2013