Crossscale Efficient Tensor Contractions for Coupled Cluster Computations Through Multiple Programming Model Backends
Abstract
Coupledcluster methods provide highly accurate models of molecular structure by explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrixmatrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts to extend the Libtensor framework to work in the distributed memory environment in a scalable and energy efficient manner. We achieve up to 240 speedup compared with the best optimized shared memory implementation. We attain scalability to hundreds of thousands of compute cores on three distributedmemory architectures, (Cray XC30&XC40, BlueGene/Q), and on a heterogeneous GPUCPU system (Cray XK7). As the bottlenecks shift from being computebound DGEMM's to communicationbound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling loadimbalance. Nevertheless, we preserve a uni ed interface to both programming models to maintain themore »
 Authors:

 Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). Computational Research Division
 QChem, Inc., Pleasanton, CA (United States)
 Univ. of Southern California, Los Angeles, CA (United States). Dept. of Chemistry
 Publication Date:
 Research Org.:
 Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
 Sponsoring Org.:
 USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
 OSTI Identifier:
 1274416
 Report Number(s):
 LBNL1005853
ir:1005853
 DOE Contract Number:
 AC0205CH11231
 Resource Type:
 Technical Report
 Country of Publication:
 United States
 Language:
 English
 Subject:
 74 ATOMIC AND MOLECULAR PHYSICS; 97 MATHEMATICS AND COMPUTING
Citation Formats
Ibrahim, Khaled Z., Epifanovsky, Evgeny, Williams, Samuel W., and Krylov, Anna I. Crossscale Efficient Tensor Contractions for Coupled Cluster Computations Through Multiple Programming Model Backends. United States: N. p., 2016.
Web. doi:10.2172/1274416.
Ibrahim, Khaled Z., Epifanovsky, Evgeny, Williams, Samuel W., & Krylov, Anna I. Crossscale Efficient Tensor Contractions for Coupled Cluster Computations Through Multiple Programming Model Backends. United States. https://doi.org/10.2172/1274416
Ibrahim, Khaled Z., Epifanovsky, Evgeny, Williams, Samuel W., and Krylov, Anna I. Tue .
"Crossscale Efficient Tensor Contractions for Coupled Cluster Computations Through Multiple Programming Model Backends". United States. https://doi.org/10.2172/1274416. https://www.osti.gov/servlets/purl/1274416.
@article{osti_1274416,
title = {Crossscale Efficient Tensor Contractions for Coupled Cluster Computations Through Multiple Programming Model Backends},
author = {Ibrahim, Khaled Z. and Epifanovsky, Evgeny and Williams, Samuel W. and Krylov, Anna I.},
abstractNote = {Coupledcluster methods provide highly accurate models of molecular structure by explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrixmatrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts to extend the Libtensor framework to work in the distributed memory environment in a scalable and energy efficient manner. We achieve up to 240 speedup compared with the best optimized shared memory implementation. We attain scalability to hundreds of thousands of compute cores on three distributedmemory architectures, (Cray XC30&XC40, BlueGene/Q), and on a heterogeneous GPUCPU system (Cray XK7). As the bottlenecks shift from being computebound DGEMM's to communicationbound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling loadimbalance. Nevertheless, we preserve a uni ed interface to both programming models to maintain the productivity of computational quantum chemists.},
doi = {10.2172/1274416},
url = {https://www.osti.gov/biblio/1274416},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2016},
month = {7}
}