ScalaBLAST: A Scalable Implementation of BLAST for High Performance Data-Intensive Bioinformatics Analysis
Journal Article
·
· IEEE Transactions on Parallel and Distributed Systems, 17(8):740-749
Genes in an organism’s DNA (genome) have embedded in them information about proteins, which are the molecules that do most of a cell’s work. A typical bacterial genome contains on the order of 5000 genes. Mammalian genomes can contain hundreds of thousands of genes. For each genome sequenced, the challenge is to identify protein components (proteome) being actively used for a given set of conditions. Fundamentally, sequence alignment is a sequence matching problem focused at unlocking protein information embedded in the genetic code, making it possible to assemble a “tree of life” by comparing new sequences against all sequences from known organisms. But the memory footprint of sequence data is growing more rapidly than per-node core memory. Despite years of research and development, high performance sequence alignment applications either do not scale well, cannot accommodate very large databases in core, or require special hardware. We have developed a high performance sequence alignment application, ScalaBLAST, which accommodates very large databases, and which scales linearly to hundreds of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high performance sequence alignment with scaling and portability. ScalaBLAST, relies on a collection of innovative techniques -- distributing the target database over available memory, multi-level parallelism to exploit concurrency, parallel I/O, and latency hiding through data prefetching -- to achieve high performance and scalability. This demonstrated approach of database sharing combined with effective task scheduling should have broad ranging applications to other informatics-driven sciences.
- Research Organization:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (US), Environmental Molecular Sciences Laboratory (EMSL)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 889526
- Report Number(s):
- PNNL-SA-46431; 15490; KJ0101030
- Journal Information:
- IEEE Transactions on Parallel and Distributed Systems, 17(8):740-749, Journal Name: IEEE Transactions on Parallel and Distributed Systems, 17(8):740-749 Journal Issue: 8 Vol. 17
- Country of Publication:
- United States
- Language:
- English
Similar Records
Bringing large-scale multiple genome analysis one step closer: ScalaBLAST and beyond
High-throughput computation of pairwise sequence similarities for multiple genome comparison using ScalaBLAST
ScalaBLAST 2.0: Rapid and robust BLAST calculations on multiprocessor systems
Technical Report
·
Fri Jun 01 00:00:00 EDT 2007
·
OSTI ID:960403
High-throughput computation of pairwise sequence similarities for multiple genome comparison using ScalaBLAST
Conference
·
Thu May 01 00:00:00 EDT 2008
·
OSTI ID:935612
ScalaBLAST 2.0: Rapid and robust BLAST calculations on multiprocessor systems
Journal Article
·
Fri Mar 15 00:00:00 EDT 2013
· Bioinformatics, 29(6):797-8
·
OSTI ID:1072883