Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale

Daily, Jeffrey A.

doi:10.2172/1186981

Title: Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale

Thesis/Dissertation · Fri May 01 00:00:00 EDT 2015

DOI:https://doi.org/10.2172/1186981· OSTI ID:1186981

Daily, Jeffrey A. ^[1]

Washington State Univ., Pullman, WA (United States)

The field of bioinformatics and computational biology is currently experiencing a data revolution. The exciting prospect of making fundamental biological discoveries is fueling the rapid development and deployment of numerous cost-effective, high-throughput next-generation sequencing technologies. The result is that the DNA and protein sequence repositories are being bombarded with new sequence information. Databases are continuing to report a Moore’s law-like growth trajectory in their database sizes, roughly doubling every 18 months. In what seems to be a paradigm-shift, individual projects are now capable of generating billions of raw sequence data that need to be analyzed in the presence of already annotated sequence information. While it is clear that data-driven methods, such as sequencing homology detection, are becoming the mainstay in the field of computational life sciences, the algorithmic advancements essential for implementing complex data analytics at scale have mostly lagged behind. Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or “homologous”) on the basis of alignment criteria. While there are optimal alignment algorithms to compute pairwise homology, their deployment for large-scale is currently not feasible; instead, heuristic methods are used at the expense of quality. In this dissertation, we present the design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for a collection of 2.56M sequences show parallel efficiencies of ~75-100% on up to 8K cores, representing a time-to-solution of 33 seconds. We extend this work with a detailed analysis of single-node sequence alignment performance using the latest CPU vector instruction set extensions. Preliminary results reveal that current sequence alignment algorithms are unable to fully utilize widening vector registers.

View Thesis/Dissertation

Cite

Export

Save

Research Organization:: Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-76RL01830

OSTI ID:: 1186981

Report Number(s):: PNNL-24266; TRN: US1601294

Country of Publication:: United States

Language:: English

Similar Records

A work stealing based approach for enabling scalable optimal sequence homology detection

Journal Article · Fri May 01 00:00:00 EDT 2015 · Journal of Parallel and Distributed Computing · OSTI ID:1186981

Daily, Jeffrey A.; Kalyanaraman, Anantharaman; Krishnamoorthy, Sriram; +1 more

Towards Scalable Optimal Sequence Homology Detection

Conference · Wed Dec 26 00:00:00 EST 2012 · OSTI ID:1186981

Daily, Jeffrey A.; Krishnamoorthy, Sriram; Kalyanaraman, Anantharaman

FASTERp: A Feature Array Search Tool for Estimating Resemblance of Protein Sequences

Conference · Fri Mar 14 00:00:00 EDT 2014 · OSTI ID:1186981

Macklin, Derek; Egan, Rob; Wang, Zhong

Related Subjects

97 MATHEMATICS AND COMPUTING
59 BASIC BIOLOGICAL SCIENCES
PARALLEL PROCESSING
DATA ANALYSIS
ALIGNMENT
DETECTION
ALGORITHMS
VECTORS
AMINO ACID SEQUENCE
DNA
SUPERCOMPUTERS
PROTEINS
MATHEMATICAL SOLUTIONS
DESIGN
EFFICIENCY
EVALUATION
IMPLEMENTATION
PERFORMANCE
sequence alignment
smith waterman
work stealing
metagenomics

Title: Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale

Citation Formats

Similar Records

Related Subjects