skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale

Abstract

The field of bioinformatics and computational biology is currently experiencing a data revolution. The exciting prospect of making fundamental biological discoveries is fueling the rapid development and deployment of numerous cost-effective, high-throughput next-generation sequencing technologies. The result is that the DNA and protein sequence repositories are being bombarded with new sequence information. Databases are continuing to report a Moore’s law-like growth trajectory in their database sizes, roughly doubling every 18 months. In what seems to be a paradigm-shift, individual projects are now capable of generating billions of raw sequence data that need to be analyzed in the presence of already annotated sequence information. While it is clear that data-driven methods, such as sequencing homology detection, are becoming the mainstay in the field of computational life sciences, the algorithmic advancements essential for implementing complex data analytics at scale have mostly lagged behind. Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or “homologous”) on the basis of alignment criteria. While there are optimal alignment algorithms to compute pairwise homology, their deployment for large-scale is currentlymore » not feasible; instead, heuristic methods are used at the expense of quality. In this dissertation, we present the design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for a collection of 2.56M sequences show parallel efficiencies of ~75-100% on up to 8K cores, representing a time-to-solution of 33 seconds. We extend this work with a detailed analysis of single-node sequence alignment performance using the latest CPU vector instruction set extensions. Preliminary results reveal that current sequence alignment algorithms are unable to fully utilize widening vector registers.« less

Authors:
 [1]
  1. Washington State Univ., Pullman, WA (United States)
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1186981
Report Number(s):
PNNL-24266
TRN: US1601294
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Thesis/Dissertation
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; 59 BASIC BIOLOGICAL SCIENCES; PARALLEL PROCESSING; DATA ANALYSIS; ALIGNMENT; DETECTION; ALGORITHMS; VECTORS; AMINO ACID SEQUENCE; DNA; SUPERCOMPUTERS; PROTEINS; MATHEMATICAL SOLUTIONS; DESIGN; EFFICIENCY; EVALUATION; IMPLEMENTATION; PERFORMANCE; sequence alignment; smith waterman; work stealing; metagenomics

Citation Formats

Daily, Jeffrey A. Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale. United States: N. p., 2015. Web. doi:10.2172/1186981.
Daily, Jeffrey A. Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale. United States. doi:10.2172/1186981.
Daily, Jeffrey A. Fri . "Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale". United States. doi:10.2172/1186981. https://www.osti.gov/servlets/purl/1186981.
@article{osti_1186981,
title = {Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale},
author = {Daily, Jeffrey A.},
abstractNote = {The field of bioinformatics and computational biology is currently experiencing a data revolution. The exciting prospect of making fundamental biological discoveries is fueling the rapid development and deployment of numerous cost-effective, high-throughput next-generation sequencing technologies. The result is that the DNA and protein sequence repositories are being bombarded with new sequence information. Databases are continuing to report a Moore’s law-like growth trajectory in their database sizes, roughly doubling every 18 months. In what seems to be a paradigm-shift, individual projects are now capable of generating billions of raw sequence data that need to be analyzed in the presence of already annotated sequence information. While it is clear that data-driven methods, such as sequencing homology detection, are becoming the mainstay in the field of computational life sciences, the algorithmic advancements essential for implementing complex data analytics at scale have mostly lagged behind. Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or “homologous”) on the basis of alignment criteria. While there are optimal alignment algorithms to compute pairwise homology, their deployment for large-scale is currently not feasible; instead, heuristic methods are used at the expense of quality. In this dissertation, we present the design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for a collection of 2.56M sequences show parallel efficiencies of ~75-100% on up to 8K cores, representing a time-to-solution of 33 seconds. We extend this work with a detailed analysis of single-node sequence alignment performance using the latest CPU vector instruction set extensions. Preliminary results reveal that current sequence alignment algorithms are unable to fully utilize widening vector registers.},
doi = {10.2172/1186981},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2015},
month = {5}
}

Thesis/Dissertation:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this thesis or dissertation.

Save / Share: