skip to main content

Title: Enabling Graph Appliance for Genome Assembly

In recent years, there has been a huge growth in the amount of genomic data available as reads generated from various genome sequencers. The number of reads generated can be huge, ranging from hundreds to billions of nucleotide, each varying in size. Assembling such large amounts of data is one of the challenging computational problems for both biomedical and data scientists. Most of the genome assemblers developed have used de Bruijn graph techniques. A de Bruijn graph represents a collection of read sequences by billions of vertices and edges, which require large amounts of memory and computational power to store and process. This is the major drawback to de Bruijn graph assembly. Massively parallel, multi-threaded, shared memory systems can be leveraged to overcome some of these issues. The objective of our research is to investigate the feasibility and scalability issues of de Bruijn graph assembly on Cray s Urika-GD system; Urika-GD is a high performance graph appliance with a large shared memory and massively multithreaded custom processor designed for executing SPARQL queries over large-scale RDF data sets. However, to the best of our knowledge, there is no research on representing a de Bruijn graph as an RDF graph or findingmore » Eulerian paths in RDF graphs using SPARQL for potential genome discovery. In this paper, we address the issues involved in representing a de Bruin graphs as RDF graphs and propose an iterative querying approach for finding Eulerian paths in large RDF graphs. We evaluate the performance of our implementation on real world ebola genome datasets and illustrate how genome assembly can be accomplished with Urika-GD using iterative SPARQL queries.« less
 [1] ;  [1] ;  [2] ;  [1] ;  [1]
  1. ORNL
  2. (Matt) [ORNL
Publication Date:
OSTI Identifier:
DOE Contract Number:
Resource Type:
Resource Relation:
Conference: IEEE Big Data Workshop on Mining Big Data to Improve Clinical Effectiveness in conjuction with IEEE Big Data, Santa Clara, CA, USA, 20151028, 20151102
Research Org:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org:
ORNL LDRD Director's R&D
Country of Publication:
United States
Big data; data science; health informatics; genome assembly