Thread-level parallelization and optimization of NWChem for the Intel MIC architecture

Shan, Hongzhang; Williams, Samuel; de Jong, Wibe; Oliker, Leonid

doi:10.1145/2712386.2712391

Thread-level parallelization and optimization of NWChem for the Intel MIC architecture

Conference · Thu Jan 01 04:00:00 EST 2015

DOI:https://doi.org/10.1145/2712386.2712391· OSTI ID:1407275

Shan, Hongzhang ^[1]; Williams, Samuel ^[1]; de Jong, Wibe ^[1]; Oliker, Leonid ^[1]

Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

In the multicore era it was possible to exploit the increase in on-chip parallelism by simply running multiple MPI processes per chip. Unfortunately, manycore processors' greatly increased thread- and data-level parallelism coupled with a reduced memory capacity demand an altogether different approach. In this paper we explore augmenting two NWChem modules, triples correction of the CCSD(T) and Fock matrix construction, with OpenMP in order that they might run efficiently on future manycore architectures. As the next NERSC machine will be a self-hosted Intel MIC (Xeon Phi) based supercomputer, we leverage an existing MIC testbed at NERSC to evaluate our experiments. In order to proxy the fact that future MIC machines will not have a host processor, we run all of our experiments in native mode. We found that while straightforward application of OpenMP to the deep loop nests associated with the tensor contractions of CCSD(T) was sufficient in attaining high performance, significant e ort was required to safely and efeciently thread the TEXAS integral package when constructing the Fock matrix. Ultimately, our new MPI+OpenMP hybrid implementations attain up to 65× better performance for the triples part of the CCSD(T) due in large part to the fact that the limited on-card memory limits the existing MPI implementation to a single process per card. Additionally, we obtain up to 1.6× better performance on Fock matrix constructions when compared with the best MPI implementations running multiple processes per card.

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)

DOE Contract Number:: AC02-05CH11231

OSTI ID:: 1407275

Country of Publication:: United States

Language:: English

References (10)

Optimizing tensor contraction expressions for hybrid CPU-GPU execution Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste Cluster Computing, Vol. 16, Issue 1 https://doi.org/10.1007/s10586-011-0179-2	journal	November 2011
A full coupled‐cluster singles and doubles model: The inclusion of disconnected triples Purvis, George D.; Bartlett, Rodney J. The Journal of Chemical Physics, Vol. 76, Issue 4 https://doi.org/10.1063/1.443164	journal	February 1982
Efficient Implementation of Many-Body Quantum Chemical Methods on the Intel® Xeon Phi Coprocessor Apra, Edoardo; Klemm, Michael; Kowalski, Karol SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.60	conference	November 2014
Efficient implementation of the gauge-independent atomic orbital method for NMR chemical shift calculations Wolinski, Krzysztof; Hinton, James F.; Pulay, Peter Journal of the American Chemical Society, Vol. 112, Issue 23 https://doi.org/10.1021/ja00179a005	journal	November 1990
Molecular integrals Over Gaussian Basis Functions Gill, Peter M. W. Advances in Quantum Chemistry https://doi.org/10.1016/S0065-3276(08)60019-2	book	January 1994
Efficient recursive computation of molecular integrals over Cartesian Gaussian functions Obara, S.; Saika, A. The Journal of Chemical Physics, Vol. 84, Issue 7 https://doi.org/10.1063/1.450106	journal	April 1986
CCSD[T] Describes Noncovalent Interactions Better than the CCSD(T), CCSD(TQ), and CCSDT Methods Řezáč, Jan; Šimová, Lucia; Hobza, Pavel Journal of Chemical Theory and Computation, Vol. 9, Issue 1 https://doi.org/10.1021/ct3008777	journal	November 2012
Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models Baumgartner, G.; Auer, A.; Bernholdt, D. E. Proceedings of the IEEE, Vol. 93, Issue 2 https://doi.org/10.1109/JPROC.2004.840311	journal	February 2005
The reduced multiplication scheme of the Rys quadrature and new recurrence relations for auxiliary function based two‐electron integral evaluation Lindh, R.; Ryu, U.; Liu, B. The Journal of Chemical Physics, Vol. 95, Issue 8 https://doi.org/10.1063/1.461610	journal	October 1991
GPU-Based Implementations of the Noniterative Regularized-CCSD(T) Corrections: Applications to Strongly Correlated Systems Ma, Wenjing; Krishnamoorthy, Sriram; Villa, Oreste Journal of Chemical Theory and Computation, Vol. 7, Issue 5 https://doi.org/10.1021/ct1007247	journal	April 2011

Similar Records

Thread-Level Parallelization and Optimization of NWChem for the Intel MIC Architecture

Technical Report · Fri Oct 10 00:00:00 EDT 2014 · OSTI ID:1163233

A Locality-Based Threading Algorithm for the Configuration-Interaction Method

Journal Article · Sun Jul 02 20:00:00 EDT 2017 · IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum · OSTI ID:1393243

Performance and Energy Usage of Workloads on KNL and Haswell Architectures. In: High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation

Conference · Sun Dec 31 23:00:00 EST 2017 · Lecture Notes in Computer Science · OSTI ID:1546612

Related Subjects

CCSD(T)
Fock Matrix Construction
MPI
Manycore Architecture
NWChem
OpenMP
OpenMP Task
Performance
Texas Integral
Thread-level Parallelism

Thread-level parallelization and optimization of NWChem for the Intel MIC architecture

Citation Formats

References (10)

Similar Records

Related Subjects