skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: pGraph: Efficient Parallel Construction of Large-Scale Protein Sequence Homology Graphs

Journal Article · · IEEE Transactions on Parallel and Distributed Systems, 23(10):1923-1933
DOI:https://doi.org/10.1109/TPDS.2012.19· OSTI ID:1053372

Detecting sequence homology between protein sequences is a fundamental problem in computational molecular biology, with a pervasive application in nearly all analyses that aim to structurally and functionally characterize protein molecules. While detecting the homology between two protein sequences is relatively inexpensive, detecting pairwise homology for a large number of protein sequences can become computationally prohibitive for modern inputs, often requiring millions of CPU hours. Yet, there is currently no robust support to parallelize this kernel. In this paper, we identify the key characteristics that make this problemparticularly hard to parallelize, and then propose a new parallel algorithm that is suited for detecting homology on large data sets using distributed memory parallel computers. Our method, called pGraph, is a novel hybrid between the hierarchical multiple-master/worker model and producer-consumer model, and is designed to break the irregularities imposed by alignment computation and work generation. Experimental results show that pGraph achieves linear scaling on a 2,048 processor distributed memory cluster for a wide range of inputs ranging from as small as 20,000 sequences to 2,560,000 sequences. In addition to demonstrating strong scaling, we present an extensive report on the performance of the various system components and related parametric studies.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States). Environmental Molecular Sciences Lab. (EMSL)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1053372
Report Number(s):
PNNL-SA-77189; 30994; KJ0403000
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, 23(10):1923-1933, Journal Name: IEEE Transactions on Parallel and Distributed Systems, 23(10):1923-1933
Country of Publication:
United States
Language:
English

Similar Records

A Scalable Parallel Algorithm for Large-Scale Protein Sequence Homology Detection
Conference · Mon Sep 13 00:00:00 EDT 2010 · OSTI ID:1053372

Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale
Thesis/Dissertation · Fri May 01 00:00:00 EDT 2015 · OSTI ID:1053372

A work stealing based approach for enabling scalable optimal sequence homology detection
Journal Article · Fri May 01 00:00:00 EDT 2015 · Journal of Parallel and Distributed Computing · OSTI ID:1053372