Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

SpaRC: scalable sequence clustering using Apache Spark

Journal Article · · Bioinformatics
 [1];  [2];  [3];  [1];  [4];  [5]
  1. Florida State Univ., Tallahassee, FL (United States)
  2. USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  3. Pacific Biosciences Inc., Menlo Park, CA (United States)
  4. USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California at Merced,CA (United States)
  5. Canada's Michael Smith Genome Sciences Centre, Vancouver, BC (Canada); Univ. of British Columbia, Vancouver, BC (Canada); Simon Fraser Univ., Burnaby, BC (Canada)

MOTIVATION: Whole genome shotgun based next-generation transcriptomics and metagenomics studies often generate 100-1000 GB sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes.

RESULTS: In this work we describe an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It reaches near-linear scalability with input data size and number of compute nodes. SpaRC can run on both cloud computing and HPC environments without modification while delivering similar performance. Our findings demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large-scale sequence data analysis problems.

AVAILABILITY AND IMPLEMENTATION: https://bitbucket.org/berkeleylab/jgi-sparc.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Biological and Environmental Research (BER) (SC-23)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1542383
Alternate ID(s):
OSTI ID: 1471135
Journal Information:
Bioinformatics, Journal Name: Bioinformatics Journal Issue: 5 Vol. 35; ISSN 1367-4803
Publisher:
Oxford University PressCopyright Statement
Country of Publication:
United States
Language:
English

References (29)

An Eulerian path approach to DNA fragment assembly journal August 2001
Pregel: a system for large-scale graph processing conference January 2010
A case study of tuning MapReduce for efficient Bioinformatics in the cloud journal January 2017
Assembly algorithms for next-generation sequencing data journal June 2010
Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning journal September 2015
Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software journal October 2017
Metagenomics: DNA sequencing of environmental samples journal October 2005
Next-generation transcriptome assembly journal September 2011
Next generation sequencing data of a defined microbial mock community journal September 2016
A near complete snapshot of the Zea mays seedling transcriptome revealed from ultra-deep sequencing journal March 2014
Tackling soil diversity with the assembly of large, complex metagenomes journal March 2014
DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly journal February 2015
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers journal January 2011
MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample journal September 2012
DSK: k-mer counting with very low memory usage journal January 2013
KMC 2: fast and resource-frugal k-mer counting journal January 2015
MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph journal January 2015
Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences journal March 2016
Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark journal September 2016
Accurate and comprehensive sequencing of personal genomes journal July 2011
Methane yield phenotypes linked to differential gene expression in the sheep rumen microbiome journal June 2014
metaSPAdes: a new versatile metagenomic assembler journal March 2017
Near linear time algorithm to detect community structures in large-scale networks journal September 2007
Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen journal January 2011
Structure and function of the global ocean microbiome journal May 2015
Counting the Uncountable: Statistical Approaches to Estimating Microbial Diversity journal October 2001
A framework for space-efficient read clustering in metagenomic samples journal March 2017
SparkBLAST: scalable BLAST processing using in-memory operations journal June 2017
Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing journal July 2015

Cited By (1)

Computational Strategies for Scalable Genomics Analysis journal December 2019

Similar Records

MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities
Journal Article · Wed Dec 31 23:00:00 EST 2014 · PeerJ · OSTI ID:1257449

De novo Nanopore read quality improvement using deep learning
Journal Article · Tue Nov 05 23:00:00 EST 2019 · BMC Bioinformatics · OSTI ID:1581387

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark
Journal Article · Fri Oct 11 00:00:00 EDT 2019 · BMC Bioinformatics · OSTI ID:1618535

Related Subjects