An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Lyakh, Dmitry I.

doi:10.1016/j.cpc.2014.12.013

Title: An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Journal Article · Mon Jan 05 04:00:00 UTC 2015 · Computer Physics Communications

DOI: https://doi.org/10.1016/j.cpc.2014.12.013 · OSTI ID:1185465

Lyakh, Dmitry I. ^[1]

Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: DOE Office of Science; USDOE

Grant/Contract Number:: AC05-00OR22725

OSTI ID:: 1185465

Journal Information:: Computer Physics Communications, Journal Name: Computer Physics Communications Vol. 189; ISSN 0010-4655

Publisher:: ElsevierCopyright Statement

Country of Publication:: United States

Language:: English

References (26)

New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations Epifanovsky, Evgeny; Wormit, Michael; Kuś, Tomasz Journal of Computational Chemistry, Vol. 34, Issue 26 https://doi.org/10.1002/jcc.23377	journal	July 2013
NWChem: scalable parallel computational chemistry: NWChem van Dam, H. J. J.; de Jong, W. A.; Bylaska, E. Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6 https://doi.org/10.1002/wcms.62	journal	May 2011
Software design of ACES III with the super instruction architecture: Software design of ACES III Deumens, Erik; Lotrich, Victor F.; Perera, Ajith Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6 https://doi.org/10.1002/wcms.77	journal	June 2011
Symbolic Algebra in Quantum Chemistry Hirata, So Theoretical Chemistry Accounts, Vol. 116, Issue 1-3 https://doi.org/10.1007/s00214-005-0029-5	journal	January 2006
NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations Valiev, M.; Bylaska, E. J.; Govind, N. Computer Physics Communications, Vol. 181, Issue 9, p. 1477-1489 https://doi.org/10.1016/j.cpc.2010.04.018	journal	September 2010
Multireference Nature of Chemistry: The Coupled-Cluster View Lyakh, Dmitry I.; Musiał, Monika; Lotrich, Victor F. Chemical Reviews, Vol. 112, Issue 1 https://doi.org/10.1021/cr2001417	journal	December 2011
Excitation Energies with Cost-Reduced Variant of the Active-Space EOMCCSDT Method: The EOMCCSDt-3̅ Approach Hu, Han-Shi; Kowalski, Karol Journal of Chemical Theory and Computation, Vol. 9, Issue 11 https://doi.org/10.1021/ct400501z	journal	October 2013
Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories Hirata, So The Journal of Physical Chemistry A, Vol. 107, Issue 46 https://doi.org/10.1021/jp034596z	journal	November 2003
Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry ^† Hartono, Albert; Lu, Qingda; Henretty, Thomas The Journal of Physical Chemistry A, Vol. 113, Issue 45 https://doi.org/10.1021/jp9051215	journal	November 2009
Multireference state-specific coupled-cluster methods. State-of-the-art and perspectives Ivanov, Vladimir V.; Lyakh, Dmitry I.; Adamowicz, Ludwik Physical Chemistry Chemical Physics, Vol. 11, Issue 14 https://doi.org/10.1039/b818590p	journal	January 2009
A general state-selective multireference coupled-cluster algorithm Kállay, Mihály; Szalay, Péter G.; Surján, Péter R. The Journal of Chemical Physics, Vol. 117, Issue 3 https://doi.org/10.1063/1.1483856	journal	July 2002
Automated generation of coupled-cluster diagrams: Implementation in the multireference state-specific coupled-cluster approach with the complete-active-space reference Lyakh, Dmitry I.; Ivanov, Vladimir V.; Adamowicz, Ludwik The Journal of Chemical Physics, Vol. 122, Issue 2 https://doi.org/10.1063/1.1824897	journal	January 2005
An exponential multireference wave-function Ansatz Hanrath, Michael The Journal of Chemical Physics, Vol. 123, Issue 8 https://doi.org/10.1063/1.1953407	journal	August 2005
Parallel implementation of electronic structure energy, gradient, and Hessian calculations Lotrich, V.; Flocke, N.; Ponton, M. The Journal of Chemical Physics, Vol. 128, Issue 19 https://doi.org/10.1063/1.2920482	journal	May 2008
An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster Hanrath, Michael; Engels-Putzka, Anna The Journal of Chemical Physics, Vol. 133, Issue 6 https://doi.org/10.1063/1.3467878	journal	August 2010
An adaptive coupled-cluster theory: @CC approach Lyakh, Dmitry I.; Bartlett, Rodney J. The Journal of Chemical Physics, Vol. 133, Issue 24 https://doi.org/10.1063/1.3515476	journal	December 2010
A fully simultaneously optimizing genetic approach to the highly excited coupled-cluster factorization problem Engels-Putzka, Anna; Hanrath, Michael The Journal of Chemical Physics, Vol. 134, Issue 12 https://doi.org/10.1063/1.3561739	journal	March 2011
A state‐selective multireference coupled‐cluster theory employing the single‐reference formalism Piecuch, Piotr; Oliphant, Nevin; Adamowicz, Ludwik The Journal of Chemical Physics, Vol. 99, Issue 3 https://doi.org/10.1063/1.466179	journal	August 1993
New approach to the state-specific multireference coupled-cluster formalism Adamowicz, Ludwik; Malrieu, Jean-Paul; Ivanov, Vladimir V. The Journal of Chemical Physics, Vol. 112, Issue 23 https://doi.org/10.1063/1.481649	journal	June 2000
Automatic code generation for many-body electronic structure methods: the tensor contraction engine‡‡ Auer, Alexander A.; Baumgartner, Gerald; Bernholdt, David E. Molecular Physics, Vol. 104, Issue 2 https://doi.org/10.1080/00268970500275780	journal	January 2006
An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures Ding, C. H. Q. IEEE Transactions on Parallel and Distributed Systems, Vol. 12, Issue 3 https://doi.org/10.1109/71.914776	journal	March 2001
A Communication-Optimal Framework for Contracting Distributed Tensors Rajbhandari, Samyam; Nikam, Akshay; Lai, Pai-Wei SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.36	conference	November 2014
Cache-oblivious algorithms Frigo, M.; Leiserson, C. E.; Prokop, H. 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039) https://doi.org/10.1109/SFFCS.1999.814600	conference	January 1999
Multireference State-Specific Coupled-Cluster Theory and Multiconfigurationality Index. BH Dissociation Ivanov, Vladimir V.; Adamowicz, Ludwik; Lyakh, Dmitry I. Collection of Czechoslovak Chemical Communications, Vol. 70, Issue 7 https://doi.org/10.1135/cccc20051017	journal	January 2005
A framework for load balancing of tensor contraction expressions via dynamic task partitioning Lai, Pai-Wei; Stock, Kevin; Rajbhandari, Samyam Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503290	conference	January 2013
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit Nieplocha, Jarek; Palmer, Bruce; Tipparaju, Vinod The International Journal of High Performance Computing Applications, Vol. 20, Issue 2 https://doi.org/10.1177/1094342006064503	journal	May 2006

Cited By (9)

Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors Tangherloni, Andrea; Nobile, Marco S.; Cazzaniga, Paolo The Journal of Supercomputing, Vol. 73, Issue 2 https://doi.org/10.1007/s11227-016-1778-8	journal	June 2016
Exact diagonalization of quantum lattice models on coprocessors Siro, Topi; Harju, Ari arXiv https://doi.org/10.48550/arxiv.1511.00863	text	January 2015
Design of a high-performance GEMM-like Tensor-Tensor Multiplication Springer, Paul; Bientinesi, Paolo arXiv https://doi.org/10.48550/arxiv.1607.00145	preprint	January 2016
TTC: A Tensor Transposition Compiler for Multiple Architectures Springer, Paul; Sankaran, Aravind; Bientinesi, Paolo arXiv https://doi.org/10.48550/arxiv.1607.01249	text	January 2016
HPTT: A High-Performance Tensor Transposition C++ Library Springer, Paul; Su, Tong; Bientinesi, Paolo arXiv https://doi.org/10.48550/arxiv.1704.04374	preprint	January 2017
Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation Villalonga, Benjamin; Lyakh, Dmitry; Boixo, Sergio arXiv https://doi.org/10.48550/arxiv.1905.00444	text	January 2019
The landscape of software for tensor computations Psarras, Christos; Karlsson, Lars; Li, Jiajia arXiv https://doi.org/10.48550/arxiv.2103.13756	preprint	January 2021
Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors Tangherloni, Andrea; Nobile, Marco S.; Cazzaniga, Paolo The Journal of Supercomputing, Vol. 73, Issue 2 https://doi.org/10.1007/s11227-016-1778-8	journal	June 2016
Efficient Tensor Sensing for RF Tomographic Imaging on GPUs Xu, Da; Zhang, Tao Future Internet, Vol. 11, Issue 2 https://doi.org/10.3390/fi11020046	journal	February 2019

Similar Records

Sum Reduction with OpenMP Offload on NVIDIA Grace-Hopper System

Conference · Fri Nov 01 04:00:00 UTC 2024 · OSTI ID:2483412

Jin, Zheming

Accelerating gravitational microlensing simulations using the Xeon Phi coprocessor

Journal Article · Fri Apr 07 04:00:00 UTC 2017 · Astronomy and Computing · OSTI ID:1543509

Chen, B.; Kantowski, R.; Dai, X.; +2 more

Block-Iterative Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs

Technical Report · Thu Jun 12 04:00:00 UTC 2014 · OSTI ID:1134156

Rodriguez, Manuel Rodriguez; Philip, Bobby; Wang, Zhen; +1 more

Related Subjects

97 MATHEMATICS AND COMPUTING
Intel Xeon Phi
NVidia GPU
array reordering
electronic structure
many-body theory
multireference
tensor contraction
tensor transpose

Title: An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Citation Formats

References (26)

Cited By (9)

Similar Records

Related Subjects