skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Journal Article · · Computer Physics Communications
 [1]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1185465
Alternate ID(s):
OSTI ID: 1246981
Journal Information:
Computer Physics Communications, Vol. 189; ISSN 0010-4655
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 36 works
Citation information provided by
Web of Science

References (26)

NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations journal September 2010
Parallel implementation of electronic structure energy, gradient, and Hessian calculations journal May 2008
Software design of ACES III with the super instruction architecture: Software design of ACES III
  • Deumens, Erik; Lotrich, Victor F.; Perera, Ajith
  • Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6 https://doi.org/10.1002/wcms.77
journal June 2011
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit journal May 2006
NWChem: scalable parallel computational chemistry: NWChem
  • van Dam, H. J. J.; de Jong, W. A.; Bylaska, E.
  • Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6 https://doi.org/10.1002/wcms.62
journal May 2011
Tensor Contraction Engine:  Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories journal November 2003
Symbolic Algebra in Quantum Chemistry journal January 2006
Automatic code generation for many-body electronic structure methods: the tensor contraction engine‡‡ journal January 2006
Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry journal November 2009
A framework for load balancing of tensor contraction expressions via dynamic task partitioning
  • Lai, Pai-Wei; Stock, Kevin; Rajbhandari, Samyam
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503290
conference January 2013
A Communication-Optimal Framework for Contracting Distributed Tensors
  • Rajbhandari, Samyam; Nikam, Akshay; Lai, Pai-Wei
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.36
conference November 2014
An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster journal August 2010
New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations journal July 2013
An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures journal March 2001
A state‐selective multireference coupled‐cluster theory employing the single‐reference formalism journal August 1993
New approach to the state-specific multireference coupled-cluster formalism journal June 2000
Automated generation of coupled-cluster diagrams: Implementation in the multireference state-specific coupled-cluster approach with the complete-active-space reference journal January 2005
Multireference State-Specific Coupled-Cluster Theory and Multiconfigurationality Index. BH Dissociation journal January 2005
Multireference state-specific coupled-cluster methods. State-of-the-art and perspectives journal January 2009
An exponential multireference wave-function Ansatz journal August 2005
A fully simultaneously optimizing genetic approach to the highly excited coupled-cluster factorization problem journal March 2011
A general state-selective multireference coupled-cluster algorithm journal July 2002
Excitation Energies with Cost-Reduced Variant of the Active-Space EOMCCSDT Method: The EOMCCSDt-3̅ Approach journal October 2013
Multireference Nature of Chemistry: The Coupled-Cluster View journal December 2011
An adaptive coupled-cluster theory: @CC approach journal December 2010
Cache-oblivious algorithms conference January 1999

Cited By (8)

Efficient Tensor Sensing for RF Tomographic Imaging on GPUs journal February 2019
Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors journal June 2016
Exact diagonalization of quantum lattice models on coprocessors text January 2015
Design of a high-performance GEMM-like Tensor-Tensor Multiplication preprint January 2016
TTC: A Tensor Transposition Compiler for Multiple Architectures text January 2016
HPTT: A High-Performance Tensor Transposition C++ Library preprint January 2017
Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation text January 2019
The landscape of software for tensor computations preprint January 2021

Similar Records

Quantum Monte Carlo Endstation for Petascale Computing
Technical Report · Wed Mar 02 00:00:00 EST 2011 · OSTI ID:1185465

Investigation of Portable Event-Based Monte Carlo Transport Using the NVIDIA Thrust Library
Journal Article · Wed Jun 15 00:00:00 EDT 2016 · Transactions of the American Nuclear Society · OSTI ID:1185465

Distributed out-of-memory NMF on CPU/GPU architectures
Journal Article · Fri Sep 08 00:00:00 EDT 2023 · Journal of Supercomputing · OSTI ID:1185465