DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Journal Article · · Computer Physics Communications
 [1]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
DOE Office of Science; USDOE
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1185465
Journal Information:
Computer Physics Communications, Journal Name: Computer Physics Communications Vol. 189; ISSN 0010-4655
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (26)

New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations journal July 2013
NWChem: scalable parallel computational chemistry: NWChem
  • van Dam, H. J. J.; de Jong, W. A.; Bylaska, E.
  • Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6 https://doi.org/10.1002/wcms.62
journal May 2011
Software design of ACES III with the super instruction architecture: Software design of ACES III
  • Deumens, Erik; Lotrich, Victor F.; Perera, Ajith
  • Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6 https://doi.org/10.1002/wcms.77
journal June 2011
Symbolic Algebra in Quantum Chemistry journal January 2006
NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations journal September 2010
Multireference Nature of Chemistry: The Coupled-Cluster View journal December 2011
Excitation Energies with Cost-Reduced Variant of the Active-Space EOMCCSDT Method: The EOMCCSDt-3̅ Approach journal October 2013
Tensor Contraction Engine:  Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories journal November 2003
Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry journal November 2009
Multireference state-specific coupled-cluster methods. State-of-the-art and perspectives journal January 2009
A general state-selective multireference coupled-cluster algorithm journal July 2002
Automated generation of coupled-cluster diagrams: Implementation in the multireference state-specific coupled-cluster approach with the complete-active-space reference journal January 2005
An exponential multireference wave-function Ansatz journal August 2005
Parallel implementation of electronic structure energy, gradient, and Hessian calculations journal May 2008
An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster journal August 2010
An adaptive coupled-cluster theory: @CC approach journal December 2010
A fully simultaneously optimizing genetic approach to the highly excited coupled-cluster factorization problem journal March 2011
A state‐selective multireference coupled‐cluster theory employing the single‐reference formalism journal August 1993
New approach to the state-specific multireference coupled-cluster formalism journal June 2000
Automatic code generation for many-body electronic structure methods: the tensor contraction engine‡‡ journal January 2006
An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures journal March 2001
A Communication-Optimal Framework for Contracting Distributed Tensors
  • Rajbhandari, Samyam; Nikam, Akshay; Lai, Pai-Wei
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.36
conference November 2014
Cache-oblivious algorithms conference January 1999
Multireference State-Specific Coupled-Cluster Theory and Multiconfigurationality Index. BH Dissociation journal January 2005
A framework for load balancing of tensor contraction expressions via dynamic task partitioning
  • Lai, Pai-Wei; Stock, Kevin; Rajbhandari, Samyam
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503290
conference January 2013
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit journal May 2006

Cited By (9)

Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors journal June 2016
Exact diagonalization of quantum lattice models on coprocessors text January 2015
Design of a high-performance GEMM-like Tensor-Tensor Multiplication preprint January 2016
TTC: A Tensor Transposition Compiler for Multiple Architectures text January 2016
HPTT: A High-Performance Tensor Transposition C++ Library preprint January 2017
Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation text January 2019
The landscape of software for tensor computations preprint January 2021
Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors journal June 2016
Efficient Tensor Sensing for RF Tomographic Imaging on GPUs journal February 2019