skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Enabling Graph Appliance for Genome Assembly

Abstract

In recent years, there has been a huge growth in the amount of genomic data available as reads generated from various genome sequencers. The number of reads generated can be huge, ranging from hundreds to billions of nucleotide, each varying in size. Assembling such large amounts of data is one of the challenging computational problems for both biomedical and data scientists. Most of the genome assemblers developed have used de Bruijn graph techniques. A de Bruijn graph represents a collection of read sequences by billions of vertices and edges, which require large amounts of memory and computational power to store and process. This is the major drawback to de Bruijn graph assembly. Massively parallel, multi-threaded, shared memory systems can be leveraged to overcome some of these issues. The objective of our research is to investigate the feasibility and scalability issues of de Bruijn graph assembly on Cray s Urika-GD system; Urika-GD is a high performance graph appliance with a large shared memory and massively multithreaded custom processor designed for executing SPARQL queries over large-scale RDF data sets. However, to the best of our knowledge, there is no research on representing a de Bruijn graph as an RDF graph or findingmore » Eulerian paths in RDF graphs using SPARQL for potential genome discovery. In this paper, we address the issues involved in representing a de Bruin graphs as RDF graphs and propose an iterative querying approach for finding Eulerian paths in large RDF graphs. We evaluate the performance of our implementation on real world ebola genome datasets and illustrate how genome assembly can be accomplished with Urika-GD using iterative SPARQL queries.« less

Authors:
 [1];  [1];  [1];  [1];  [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
1224761
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: IEEE Big Data Workshop on Mining Big Data to Improve Clinical Effectiveness in conjuction with IEEE Big Data, Santa Clara, CA, USA, 20151028, 20151102
Country of Publication:
United States
Language:
English
Subject:
Big data; data science; health informatics; genome assembly

Citation Formats

Singh, Rina, Graves, Jeffrey A, Lee, Sangkeun, Sukumar, Sreenivas R, and Shankar, Mallikarjun. Enabling Graph Appliance for Genome Assembly. United States: N. p., 2015. Web.
Singh, Rina, Graves, Jeffrey A, Lee, Sangkeun, Sukumar, Sreenivas R, & Shankar, Mallikarjun. Enabling Graph Appliance for Genome Assembly. United States.
Singh, Rina, Graves, Jeffrey A, Lee, Sangkeun, Sukumar, Sreenivas R, and Shankar, Mallikarjun. 2015. "Enabling Graph Appliance for Genome Assembly". United States. https://www.osti.gov/servlets/purl/1224761.
@article{osti_1224761,
title = {Enabling Graph Appliance for Genome Assembly},
author = {Singh, Rina and Graves, Jeffrey A and Lee, Sangkeun and Sukumar, Sreenivas R and Shankar, Mallikarjun},
abstractNote = {In recent years, there has been a huge growth in the amount of genomic data available as reads generated from various genome sequencers. The number of reads generated can be huge, ranging from hundreds to billions of nucleotide, each varying in size. Assembling such large amounts of data is one of the challenging computational problems for both biomedical and data scientists. Most of the genome assemblers developed have used de Bruijn graph techniques. A de Bruijn graph represents a collection of read sequences by billions of vertices and edges, which require large amounts of memory and computational power to store and process. This is the major drawback to de Bruijn graph assembly. Massively parallel, multi-threaded, shared memory systems can be leveraged to overcome some of these issues. The objective of our research is to investigate the feasibility and scalability issues of de Bruijn graph assembly on Cray s Urika-GD system; Urika-GD is a high performance graph appliance with a large shared memory and massively multithreaded custom processor designed for executing SPARQL queries over large-scale RDF data sets. However, to the best of our knowledge, there is no research on representing a de Bruijn graph as an RDF graph or finding Eulerian paths in RDF graphs using SPARQL for potential genome discovery. In this paper, we address the issues involved in representing a de Bruin graphs as RDF graphs and propose an iterative querying approach for finding Eulerian paths in large RDF graphs. We evaluate the performance of our implementation on real world ebola genome datasets and illustrate how genome assembly can be accomplished with Urika-GD using iterative SPARQL queries.},
doi = {},
url = {https://www.osti.gov/biblio/1224761}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2015},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: