Extreme Scale De Novo Metagenome Assembly
- Intel Corporation, Santa Clara, CA (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. State-of-the-art tools require big shared memory machines and cannot handle contemporary metagenome datasets that exceed Terabytes in size. In this paper, we introduce the MetaHipMer pipeline, a high-quality and high-performance metagenome assembler that employs an iterative de Bruijn graph approach. MetaHipMer leverages a specialized scaffolding algorithm that produces long scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is end-to-end parallelized using the Unified Parallel C language and therefore can run seamlessly on shared and distributed-memory systems. Experimental results show that MetaHipMer matches or outperforms the state-of-the-art tools in terms of accuracy. Ultimately, MetaHipMer scales efficiently to large concurrencies and is able to assemble previously intractable grand challenge metagenomes. We demonstrate the unprecedented capability of MetaHipMer by computing the first full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion reads - size 2.6 TBytes.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- DOE Contract Number:
- AC02-05CH11231
- OSTI ID:
- 1581597
- Country of Publication:
- United States
- Language:
- English
Similar Records
Related Subjects
MetaHipMer leverages
MetaHipMer pipeline
MetaHipMer scales
distributed-memory systems
environmental samples
erroneous DNA segments
extreme scale de novo metagenome assembly
high-performance metagenome assembler
intractable grand challenge metagenomes
iterative de Bruijn graph approach
microbiomes
scaffolding algorithm
shared memory machines
unified parallel C language