SpaRC: scalable sequence clustering using Apache Spark
- Florida State Univ., Tallahassee, FL (United States)
- USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Pacific Biosciences Inc., Menlo Park, CA (United States)
- USDOE Joint Genome Institute (JGI), Walnut Creek, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California at Merced,CA (United States)
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC (Canada); Univ. of British Columbia, Vancouver, BC (Canada); Simon Fraser Univ., Burnaby, BC (Canada)
MOTIVATION: Whole genome shotgun based next-generation transcriptomics and metagenomics studies often generate 100-1000 GB sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes.
RESULTS: In this work we describe an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It reaches near-linear scalability with input data size and number of compute nodes. SpaRC can run on both cloud computing and HPC environments without modification while delivering similar performance. Our findings demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large-scale sequence data analysis problems.
AVAILABILITY AND IMPLEMENTATION: https://bitbucket.org/berkeleylab/jgi-sparc.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Biological and Environmental Research (BER) (SC-23)
- Grant/Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1542383
- Alternate ID(s):
- OSTI ID: 1471135
- Journal Information:
- Bioinformatics, Journal Name: Bioinformatics Journal Issue: 5 Vol. 35; ISSN 1367-4803
- Publisher:
- Oxford University PressCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Computational Strategies for Scalable Genomics Analysis
|
journal | December 2019 |
Similar Records
De novo Nanopore read quality improvement using deep learning
DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark