Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture

Shan, Hongzhang; Williams, Samuel; Jong, Wibe de; Oliker, Leonid

doi:10.2172/1163233

Title: Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture

Technical Report · Fri Oct 10 00:00:00 EDT 2014

DOI:https://doi.org/10.2172/1163233· OSTI ID:1163233

Shan, Hongzhang; Williams, Samuel; Jong, Wibe de; Oliker, Leonid

In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments. In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in tt native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant effort was required to safely and efficiently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI OpenMP hybrid implementations attain up to 65x better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6x better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.

View Technical Report

Cite

Export

Save

Research Organization:: Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: Computational Research Division

DOE Contract Number:: DE-AC02-05CH11231

OSTI ID:: 1163233

Report Number(s):: LBNL-6806E

Country of Publication:: United States

Language:: English

Similar Records

Thread-level parallelization and optimization of NWChem for the Intel MIC architecture

Conference · Thu Jan 01 00:00:00 EST 2015 · OSTI ID:1163233

Shan, Hongzhang; Williams, Samuel; de Jong, Wibe; +1 more

A Locality-Based Threading Algorithm for the Configuration-Interaction Method

Journal Article · Mon Jul 03 00:00:00 EDT 2017 · IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum · OSTI ID:1163233

Shan, Hongzhang; Williams, Samuel; Johnson, Calvin; +1 more

High-performance epistasis detection in quantitative trait GWAS

Journal Article · Tue Jul 12 00:00:00 EDT 2016 · International Journal of High Performance Computing Applications · OSTI ID:1163233

Weeks, Nathan T.; Luecke, Glenn R.; Groth, Brandon M.; +5 more

Related Subjects

97 MATHEMATICS AND COMPUTING
NWChem
OpenMP
Xeon Phi
Optimization

Title: Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture

Citation Formats

Similar Records

Related Subjects