skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: SORA: Scalable Overlap-graph Reduction Algorithms for Genome Assembly using Apache Spark in the Cloud

Abstract

The advent of high-throughput DNA sequencing techniques has permitted very high quality de novo assemblies of genomes, but raise an issue of requiring large amounts of computer memory to resolve the large genome graphs generated by most overlap graph de novo assemblers. To address these limitations, we present a novel algorithmic approach; Scalable Overlap-graph Reduction Algorithms (SORA). SORA adapts string graph reduction algorithms for the genome assembly using a distributed computing platform. To efficiently compute coverage for enormous paths in the graphs, SORA uses Apache Spark which is a cluster-based engine designed on top of Hadoop to handle very large datasets in the cloud. The experimental results show that SORA can process a nearly one billion edge graph in a distributed cloud cluster as well as smaller graphs on a local cluster with a short turnaround time. Moreover, our algorithms scale almost linearly with increasing numbers of virtual instances in the cloud. SORA is freely available for download at https://github.com/BioHPC/SORA/.

Authors:
 [1];  [2];  [3]; ORCiD logo [4];  [5];  [1]
  1. Saint Louis University, Missouri
  2. Washington University, St. Louis
  3. University of Nebraska, Omaha
  4. ORNL
  5. University of Oklahoma, Norman
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1557475
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) - Madrid, , Spain - 12/3/2018 10:00:00 AM-12/6/2018 5:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Paul, Alexander, Lawrence, Dylan, Song, Myoungkyu, Lim, Seung-Hwan, Pan, Chongle, and Hyuk ahn, Tae. SORA: Scalable Overlap-graph Reduction Algorithms for Genome Assembly using Apache Spark in the Cloud. United States: N. p., 2018. Web. doi:10.1109/BIBM.2018.8621546.
Paul, Alexander, Lawrence, Dylan, Song, Myoungkyu, Lim, Seung-Hwan, Pan, Chongle, & Hyuk ahn, Tae. SORA: Scalable Overlap-graph Reduction Algorithms for Genome Assembly using Apache Spark in the Cloud. United States. doi:10.1109/BIBM.2018.8621546.
Paul, Alexander, Lawrence, Dylan, Song, Myoungkyu, Lim, Seung-Hwan, Pan, Chongle, and Hyuk ahn, Tae. Sat . "SORA: Scalable Overlap-graph Reduction Algorithms for Genome Assembly using Apache Spark in the Cloud". United States. doi:10.1109/BIBM.2018.8621546. https://www.osti.gov/servlets/purl/1557475.
@article{osti_1557475,
title = {SORA: Scalable Overlap-graph Reduction Algorithms for Genome Assembly using Apache Spark in the Cloud},
author = {Paul, Alexander and Lawrence, Dylan and Song, Myoungkyu and Lim, Seung-Hwan and Pan, Chongle and Hyuk ahn, Tae},
abstractNote = {The advent of high-throughput DNA sequencing techniques has permitted very high quality de novo assemblies of genomes, but raise an issue of requiring large amounts of computer memory to resolve the large genome graphs generated by most overlap graph de novo assemblers. To address these limitations, we present a novel algorithmic approach; Scalable Overlap-graph Reduction Algorithms (SORA). SORA adapts string graph reduction algorithms for the genome assembly using a distributed computing platform. To efficiently compute coverage for enormous paths in the graphs, SORA uses Apache Spark which is a cluster-based engine designed on top of Hadoop to handle very large datasets in the cloud. The experimental results show that SORA can process a nearly one billion edge graph in a distributed cloud cluster as well as smaller graphs on a local cluster with a short turnaround time. Moreover, our algorithms scale almost linearly with increasing numbers of virtual instances in the cloud. SORA is freely available for download at https://github.com/BioHPC/SORA/.},
doi = {10.1109/BIBM.2018.8621546},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {12}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: