DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Deconvolute individual genomes from metagenome sequences through short read clustering

Abstract

Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads by species before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Here we extended our previous read clustering software, SpaRC, by exploiting statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using synthetic and real-world datasets we demonstrated that this method has the potential to cluster almost all of the short reads from genomes with sufficient sequencing coverage. The improved read clustering in turn leads to improved downstream genome assembly quality.

Authors:
 [1];  [1];  [2];  [1];  [3]; ORCiD logo [4]
  1. Shanghai Univ., Shanghai (China). School of Mechanics Engineering and Automation; Shanghai Key Laboratory of Power Station Automation Technology, Shanghai (China)
  2. Shanghai Univ., Shanghai (China). School of Mechanics Engineering and Automation; Shanghai Key Laboratory of Power Station Automation Technology, Shanghai (China); USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States)
  3. Florida State Univ., Tallahassee, FL (United States)
  4. USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Merced, CA (United States). School of Natural Sciences
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Biological and Environmental Research (BER); National Natural Science Foundation of China (NSFC)
OSTI Identifier:
1631617
Grant/Contract Number:  
AC02-05CH11231
Resource Type:
Accepted Manuscript
Journal Name:
PeerJ
Additional Journal Information:
Journal Volume: 8; Journal Issue: 4; Journal ID: ISSN 2167-8359
Publisher:
PeerJ Inc.
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; Metagenome clustering; Short-read clustering; Apache Spark

Citation Formats

Li, Kexue, Lu, Yakang, Deng, Li, Wang, Lili, Shi, Lizhen, and Wang, Zhong. Deconvolute individual genomes from metagenome sequences through short read clustering. United States: N. p., 2020. Web. doi:10.7717/peerj.8966.
Li, Kexue, Lu, Yakang, Deng, Li, Wang, Lili, Shi, Lizhen, & Wang, Zhong. Deconvolute individual genomes from metagenome sequences through short read clustering. United States. https://doi.org/10.7717/peerj.8966
Li, Kexue, Lu, Yakang, Deng, Li, Wang, Lili, Shi, Lizhen, and Wang, Zhong. Wed . "Deconvolute individual genomes from metagenome sequences through short read clustering". United States. https://doi.org/10.7717/peerj.8966. https://www.osti.gov/servlets/purl/1631617.
@article{osti_1631617,
title = {Deconvolute individual genomes from metagenome sequences through short read clustering},
author = {Li, Kexue and Lu, Yakang and Deng, Li and Wang, Lili and Shi, Lizhen and Wang, Zhong},
abstractNote = {Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads by species before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Here we extended our previous read clustering software, SpaRC, by exploiting statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using synthetic and real-world datasets we demonstrated that this method has the potential to cluster almost all of the short reads from genomes with sufficient sequencing coverage. The improved read clustering in turn leads to improved downstream genome assembly quality.},
doi = {10.7717/peerj.8966},
journal = {PeerJ},
number = 4,
volume = 8,
place = {United States},
year = {2020},
month = {4}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Figures / Tables:

Figure 1 Figure 1: An overview of the clustering strategies. (A) Local clustering: short reads sequences from multiple samples of a microbial communities (such as derived from different sample sites or times, S1, S2…Sm) are combined and clustered using the scalable overlap-based clustering algorithm in SpaRC. Many small clusters are formed andmore » reads from the same genomes scatter across many clusters (under-clustering). (B) Estimating genome coverage from unassembled read clusters. In the left illustration, two read clusters show different k-mer frequency peaks, each corresponding to the coverage of their underlying genome (dotted lines). In the right illustration, multiple read clusters derived from the same genome in theory will have the same genome coverage in a given sample, while the height of the peak (number of k-mers) can be very different depending on the size of the read clusters. (C) Global clustering. First, sequencing coverage of each small cluster from the local clustering step is estimated and a cluster coverage matrix is derived. Second, a square similarity matrix is obtained by computing pair-wise cosine similarities between all clusters. Finally, a graph is constructed using clusters as nodes and their similarity as weighted edges. Larger clusters containing all the reads from individual genomes can be obtained by partitioning the graph using the Label Propagation Algorithm (LPA).« less

Save / Share:

Works referenced in this record:

Metagenomics: DNA sequencing of environmental samples
journal, October 2005

  • Tringe, Susannah Green; Rubin, Edward M.
  • Nature Reviews Genetics, Vol. 6, Issue 11
  • DOI: 10.1038/nrg1709

DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly
journal, February 2015

  • Guo, Xuan; Yu, Ning; Ding, Xiaojun
  • Journal of Computational Biology, Vol. 22, Issue 2
  • DOI: 10.1089/cmb.2014.0251

Near linear time algorithm to detect community structures in large-scale networks
journal, September 2007


Tackling soil diversity with the assembly of large, complex metagenomes
journal, March 2014

  • Howe, Adina Chuang; Jansson, Janet K.; Malfatti, Stephanie A.
  • Proceedings of the National Academy of Sciences, Vol. 111, Issue 13
  • DOI: 10.1073/pnas.1402564111

SpaRC: scalable sequence clustering using Apache Spark
journal, August 2018


Methane yield phenotypes linked to differential gene expression in the sheep rumen microbiome
journal, June 2014

  • Shi, Weibing; Moon, Christina D.; Leahy, Sinead C.
  • Genome Research, Vol. 24, Issue 9
  • DOI: 10.1101/gr.168245.113

MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample
journal, September 2012


A Review of Bioinformatics Tools for Bio-Prospecting from Metagenomic Sequence Data
journal, March 2017

  • Roumpeka, Despoina D.; Wallace, R. John; Escalettes, Frank
  • Frontiers in Genetics, Vol. 8
  • DOI: 10.3389/fgene.2017.00023

Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software
journal, October 2017

  • Sczyrba, Alexander; Hofmann, Peter; Belmann, Peter
  • Nature Methods, Vol. 14, Issue 11
  • DOI: 10.1038/nmeth.4458

Structure and function of the global ocean microbiome
journal, May 2015


Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning
journal, September 2015

  • Cleary, Brian; Brito, Ilana Lauren; Huang, Katherine
  • Nature Biotechnology, Vol. 33, Issue 10
  • DOI: 10.1038/nbt.3329

Clinical metagenomics
journal, March 2019


A review of methods and databases for metagenomic classification and assembly
journal, September 2017

  • Breitwieser, Florian P.; Lu, Jennifer; Salzberg, Steven L.
  • Briefings in Bioinformatics, Vol. 20, Issue 4
  • DOI: 10.1093/bib/bbx120

Metagenomics - a guide from sampling to data analysis
journal, February 2012

  • Thomas, Torsten; Gilbert, Jack; Meyer, Folker
  • Microbial Informatics and Experimentation, Vol. 2, Issue 1
  • DOI: 10.1186/2042-5783-2-3

MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph
journal, January 2015


MetaQUAST: evaluation of metagenome assemblies
journal, November 2015


Informed and automated k-mer size selection for genome assembly
journal, June 2013


MetaProb: accurate metagenomic reads binning based on probabilistic sequence signatures
journal, September 2016


Reducing storage requirements for biological sequence comparison
journal, July 2004


metaSPAdes: a new versatile metagenomic assembler
journal, March 2017

  • Nurk, Sergey; Meleshko, Dmitry; Korobeynikov, Anton
  • Genome Research, Vol. 27, Issue 5
  • DOI: 10.1101/gr.213959.116

MinION™ nanopore sequencing of environmental metagenomes: a synthetic approach
journal, February 2017


MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities
journal, January 2015


Next generation sequencing data of a defined microbial mock community
journal, September 2016

  • Singer, Esther; Andreopoulos, Bill; Bowers, Robert M.
  • Scientific Data, Vol. 3, Issue 1
  • DOI: 10.1038/sdata.2016.81

MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies
journal, January 2019


Shotgun metagenomics, from sampling to analysis
journal, September 2017

  • Quince, Christopher; Walker, Alan W.; Simpson, Jared T.
  • Nature Biotechnology, Vol. 35, Issue 9
  • DOI: 10.1038/nbt.3935

Genomic DNA k-mer spectra: models and modalities
journal, January 2009


Rapid evaluation and quality control of next generation sequencing data with FaQCs
journal, November 2014


Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.