skip to main content

SciTech ConnectSciTech Connect

Title: Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale

The field of bioinformatics and computational biology is currently experiencing a data revolution. The exciting prospect of making fundamental biological discoveries is fueling the rapid development and deployment of numerous cost-effective, high-throughput next-generation sequencing technologies. The result is that the DNA and protein sequence repositories are being bombarded with new sequence information. Databases are continuing to report a Moore’s law-like growth trajectory in their database sizes, roughly doubling every 18 months. In what seems to be a paradigm-shift, individual projects are now capable of generating billions of raw sequence data that need to be analyzed in the presence of already annotated sequence information. While it is clear that data-driven methods, such as sequencing homology detection, are becoming the mainstay in the field of computational life sciences, the algorithmic advancements essential for implementing complex data analytics at scale have mostly lagged behind. Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or “homologous”) on the basis of alignment criteria. While there are optimal alignment algorithms to compute pairwise homology, their deployment for large-scale is currentlymore » not feasible; instead, heuristic methods are used at the expense of quality. In this dissertation, we present the design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for a collection of 2.56M sequences show parallel efficiencies of ~75-100% on up to 8K cores, representing a time-to-solution of 33 seconds. We extend this work with a detailed analysis of single-node sequence alignment performance using the latest CPU vector instruction set extensions. Preliminary results reveal that current sequence alignment algorithms are unable to fully utilize widening vector registers.« less
Authors:
 [1]
  1. Washington State Univ., Pullman, WA (United States)
Publication Date:
OSTI Identifier:
1186981
Report Number(s):
PNNL--24266
TRN: US1601294
DOE Contract Number:
AC05-76RL01830
Resource Type:
Thesis/Dissertation
Research Org:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org:
USDOE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 59 BASIC BIOLOGICAL SCIENCES; PARALLEL PROCESSING; DATA ANALYSIS; ALIGNMENT; DETECTION; ALGORITHMS; VECTORS; AMINO ACID SEQUENCE; DNA; SUPERCOMPUTERS; PROTEINS; MATHEMATICAL SOLUTIONS; DESIGN; EFFICIENCY; EVALUATION; IMPLEMENTATION; PERFORMANCE sequence alignment; smith waterman; work stealing; metagenomics