skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Thread-level parallelization and optimization of NWChem for the Intel MIC architecture

Abstract

In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments. In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant e ort was required to safely and efeciently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI+OpenMP hybrid implementations attain up to 65× better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-cardmore » memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6× better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.« less

Authors:
 [1];  [1];  [1];  [1]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1407275
DOE Contract Number:  
AC02-05CH11231
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of the 6th International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM 2015), San Francisco, CA (United States), 7-8 Feb 2015
Country of Publication:
United States
Language:
English
Subject:
OpenMP; OpenMP Task; MPI; Thread-level Parallelism; Performance; Manycore Architecture; NWChem; CCSD(T); Fock Matrix Construction; Texas Integral

Citation Formats

Shan, Hongzhang, Williams, Samuel, de Jong, Wibe, and Oliker, Leonid. Thread-level parallelization and optimization of NWChem for the Intel MIC architecture. United States: N. p., 2015. Web. doi:10.1145/2712386.2712391.
Shan, Hongzhang, Williams, Samuel, de Jong, Wibe, & Oliker, Leonid. Thread-level parallelization and optimization of NWChem for the Intel MIC architecture. United States. doi:10.1145/2712386.2712391.
Shan, Hongzhang, Williams, Samuel, de Jong, Wibe, and Oliker, Leonid. Thu . "Thread-level parallelization and optimization of NWChem for the Intel MIC architecture". United States. doi:10.1145/2712386.2712391. https://www.osti.gov/servlets/purl/1407275.
@article{osti_1407275,
title = {Thread-level parallelization and optimization of NWChem for the Intel MIC architecture},
author = {Shan, Hongzhang and Williams, Samuel and de Jong, Wibe and Oliker, Leonid},
abstractNote = {In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments. In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant e ort was required to safely and efeciently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI+OpenMP hybrid implementations attain up to 65× better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6× better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.},
doi = {10.1145/2712386.2712391},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Jan 01 00:00:00 EST 2015},
month = {Thu Jan 01 00:00:00 EST 2015}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: