An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
Abstract
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
- Authors:
-
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Publication Date:
- Research Org.:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
- Sponsoring Org.:
- USDOE Office of Science (SC)
- OSTI Identifier:
- 1185465
- Alternate Identifier(s):
- OSTI ID: 1246981
- Grant/Contract Number:
- AC05-00OR22725
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Computer Physics Communications
- Additional Journal Information:
- Journal Volume: 189; Journal ID: ISSN 0010-4655
- Publisher:
- Elsevier
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; tensor transpose; array reordering; tensor contraction; many-body theory; electronic structure; multireference; NVidia GPU; Intel Xeon Phi
Citation Formats
Lyakh, Dmitry I. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. United States: N. p., 2015.
Web. doi:10.1016/j.cpc.2014.12.013.
Lyakh, Dmitry I. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. United States. https://doi.org/10.1016/j.cpc.2014.12.013
Lyakh, Dmitry I. Mon .
"An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU". United States. https://doi.org/10.1016/j.cpc.2014.12.013. https://www.osti.gov/servlets/purl/1185465.
@article{osti_1185465,
title = {An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU},
author = {Lyakh, Dmitry I.},
abstractNote = {An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).},
doi = {10.1016/j.cpc.2014.12.013},
journal = {Computer Physics Communications},
number = ,
volume = 189,
place = {United States},
year = {Mon Jan 05 00:00:00 EST 2015},
month = {Mon Jan 05 00:00:00 EST 2015}
}
Web of Science
Works referenced in this record:
NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
journal, September 2010
- Valiev, M.; Bylaska, E. J.; Govind, N.
- Computer Physics Communications, Vol. 181, Issue 9, p. 1477-1489
Parallel implementation of electronic structure energy, gradient, and Hessian calculations
journal, May 2008
- Lotrich, V.; Flocke, N.; Ponton, M.
- The Journal of Chemical Physics, Vol. 128, Issue 19
Software design of ACES III with the super instruction architecture: Software design of ACES III
journal, June 2011
- Deumens, Erik; Lotrich, Victor F.; Perera, Ajith
- Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
journal, May 2006
- Nieplocha, Jarek; Palmer, Bruce; Tipparaju, Vinod
- The International Journal of High Performance Computing Applications, Vol. 20, Issue 2
NWChem: scalable parallel computational chemistry: NWChem
journal, May 2011
- van Dam, H. J. J.; de Jong, W. A.; Bylaska, E.
- Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6
Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories
journal, November 2003
- Hirata, So
- The Journal of Physical Chemistry A, Vol. 107, Issue 46
Symbolic Algebra in Quantum Chemistry
journal, January 2006
- Hirata, So
- Theoretical Chemistry Accounts, Vol. 116, Issue 1-3
Automatic code generation for many-body electronic structure methods: the tensor contraction engine‡‡
journal, January 2006
- Auer, Alexander A.; Baumgartner, Gerald; Bernholdt, David E.
- Molecular Physics, Vol. 104, Issue 2
Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry †
journal, November 2009
- Hartono, Albert; Lu, Qingda; Henretty, Thomas
- The Journal of Physical Chemistry A, Vol. 113, Issue 45
A framework for load balancing of tensor contraction expressions via dynamic task partitioning
conference, January 2013
- Lai, Pai-Wei; Stock, Kevin; Rajbhandari, Samyam
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
A Communication-Optimal Framework for Contracting Distributed Tensors
conference, November 2014
- Rajbhandari, Samyam; Nikam, Akshay; Lai, Pai-Wei
- SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster
journal, August 2010
- Hanrath, Michael; Engels-Putzka, Anna
- The Journal of Chemical Physics, Vol. 133, Issue 6
New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations
journal, July 2013
- Epifanovsky, Evgeny; Wormit, Michael; Kuś, Tomasz
- Journal of Computational Chemistry, Vol. 34, Issue 26
An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures
journal, March 2001
- Ding, C. H. Q.
- IEEE Transactions on Parallel and Distributed Systems, Vol. 12, Issue 3
A state‐selective multireference coupled‐cluster theory employing the single‐reference formalism
journal, August 1993
- Piecuch, Piotr; Oliphant, Nevin; Adamowicz, Ludwik
- The Journal of Chemical Physics, Vol. 99, Issue 3
New approach to the state-specific multireference coupled-cluster formalism
journal, June 2000
- Adamowicz, Ludwik; Malrieu, Jean-Paul; Ivanov, Vladimir V.
- The Journal of Chemical Physics, Vol. 112, Issue 23
Automated generation of coupled-cluster diagrams: Implementation in the multireference state-specific coupled-cluster approach with the complete-active-space reference
journal, January 2005
- Lyakh, Dmitry I.; Ivanov, Vladimir V.; Adamowicz, Ludwik
- The Journal of Chemical Physics, Vol. 122, Issue 2
Multireference State-Specific Coupled-Cluster Theory and Multiconfigurationality Index. BH Dissociation
journal, January 2005
- Ivanov, Vladimir V.; Adamowicz, Ludwik; Lyakh, Dmitry I.
- Collection of Czechoslovak Chemical Communications, Vol. 70, Issue 7
Multireference state-specific coupled-cluster methods. State-of-the-art and perspectives
journal, January 2009
- Ivanov, Vladimir V.; Lyakh, Dmitry I.; Adamowicz, Ludwik
- Physical Chemistry Chemical Physics, Vol. 11, Issue 14
An exponential multireference wave-function Ansatz
journal, August 2005
- Hanrath, Michael
- The Journal of Chemical Physics, Vol. 123, Issue 8
A fully simultaneously optimizing genetic approach to the highly excited coupled-cluster factorization problem
journal, March 2011
- Engels-Putzka, Anna; Hanrath, Michael
- The Journal of Chemical Physics, Vol. 134, Issue 12
A general state-selective multireference coupled-cluster algorithm
journal, July 2002
- Kállay, Mihály; Szalay, Péter G.; Surján, Péter R.
- The Journal of Chemical Physics, Vol. 117, Issue 3
Excitation Energies with Cost-Reduced Variant of the Active-Space EOMCCSDT Method: The EOMCCSDt-3̅ Approach
journal, October 2013
- Hu, Han-Shi; Kowalski, Karol
- Journal of Chemical Theory and Computation, Vol. 9, Issue 11
Multireference Nature of Chemistry: The Coupled-Cluster View
journal, December 2011
- Lyakh, Dmitry I.; Musiał, Monika; Lotrich, Victor F.
- Chemical Reviews, Vol. 112, Issue 1
An adaptive coupled-cluster theory: @CC approach
journal, December 2010
- Lyakh, Dmitry I.; Bartlett, Rodney J.
- The Journal of Chemical Physics, Vol. 133, Issue 24
Cache-oblivious algorithms
conference, January 1999
- Frigo, M.; Leiserson, C. E.; Prokop, H.
- 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039)
Works referencing / citing this record:
Efficient Tensor Sensing for RF Tomographic Imaging on GPUs
journal, February 2019
- Xu, Da; Zhang, Tao
- Future Internet, Vol. 11, Issue 2
Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors
journal, June 2016
- Tangherloni, Andrea; Nobile, Marco S.; Cazzaniga, Paolo
- The Journal of Supercomputing, Vol. 73, Issue 2
Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors
journal, June 2016
- Tangherloni, Andrea; Nobile, Marco S.; Cazzaniga, Paolo
- The Journal of Supercomputing, Vol. 73, Issue 2
Exact diagonalization of quantum lattice models on coprocessors
text, January 2015
- Siro, Topi; Harju, Ari
- arXiv
Design of a high-performance GEMM-like Tensor-Tensor Multiplication
preprint, January 2016
- Springer, Paul; Bientinesi, Paolo
- arXiv
TTC: A Tensor Transposition Compiler for Multiple Architectures
text, January 2016
- Springer, Paul; Sankaran, Aravind; Bientinesi, Paolo
- arXiv
HPTT: A High-Performance Tensor Transposition C++ Library
preprint, January 2017
- Springer, Paul; Su, Tong; Bientinesi, Paolo
- arXiv
Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation
text, January 2019
- Villalonga, Benjamin; Lyakh, Dmitry; Boixo, Sergio
- arXiv
The landscape of software for tensor computations
preprint, January 2021
- Psarras, Christos; Karlsson, Lars; Li, Jiajia
- arXiv