skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Abstract

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).

Authors:
 [1]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1185465
Alternate Identifier(s):
OSTI ID: 1246981
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Computer Physics Communications
Additional Journal Information:
Journal Volume: 189; Journal ID: ISSN 0010-4655
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; tensor transpose; array reordering; tensor contraction; many-body theory; electronic structure; multireference; NVidia GPU; Intel Xeon Phi

Citation Formats

Lyakh, Dmitry I. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. United States: N. p., 2015. Web. doi:10.1016/j.cpc.2014.12.013.
Lyakh, Dmitry I. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. United States. https://doi.org/10.1016/j.cpc.2014.12.013
Lyakh, Dmitry I. 2015. "An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU". United States. https://doi.org/10.1016/j.cpc.2014.12.013. https://www.osti.gov/servlets/purl/1185465.
@article{osti_1185465,
title = {An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU},
author = {Lyakh, Dmitry I.},
abstractNote = {An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).},
doi = {10.1016/j.cpc.2014.12.013},
url = {https://www.osti.gov/biblio/1185465}, journal = {Computer Physics Communications},
issn = {0010-4655},
number = ,
volume = 189,
place = {United States},
year = {2015},
month = {1}
}

Journal Article:

Citation Metrics:
Cited by: 12 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
journal, September 2010


Parallel implementation of electronic structure energy, gradient, and Hessian calculations
journal, May 2008


Software design of ACES III with the super instruction architecture: Software design of ACES III
journal, June 2011

  • Deumens, Erik; Lotrich, Victor F.; Perera, Ajith
  • Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6
  • https://doi.org/10.1002/wcms.77

Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
journal, May 2006


NWChem: scalable parallel computational chemistry: NWChem
journal, May 2011

  • van Dam, H. J. J.; de Jong, W. A.; Bylaska, E.
  • Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6
  • https://doi.org/10.1002/wcms.62

Symbolic Algebra in Quantum Chemistry
journal, January 2006


Automatic code generation for many-body electronic structure methods: the tensor contraction engine‡‡
journal, January 2006


Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry
journal, November 2009


A framework for load balancing of tensor contraction expressions via dynamic task partitioning
conference, January 2013

  • Lai, Pai-Wei; Stock, Kevin; Rajbhandari, Samyam
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
  • https://doi.org/10.1145/2503210.2503290

A Communication-Optimal Framework for Contracting Distributed Tensors
conference, November 2014

  • Rajbhandari, Samyam; Nikam, Akshay; Lai, Pai-Wei
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
  • https://doi.org/10.1109/SC.2014.36

An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster
journal, August 2010


New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations
journal, July 2013


An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures
journal, March 2001


A state‐selective multireference coupled‐cluster theory employing the single‐reference formalism
journal, August 1993


New approach to the state-specific multireference coupled-cluster formalism
journal, June 2000


Automated generation of coupled-cluster diagrams: Implementation in the multireference state-specific coupled-cluster approach with the complete-active-space reference
journal, January 2005


Multireference State-Specific Coupled-Cluster Theory and Multiconfigurationality Index. BH Dissociation
journal, January 2005


Multireference state-specific coupled-cluster methods. State-of-the-art and perspectives
journal, January 2009


An exponential multireference wave-function Ansatz
journal, August 2005


A fully simultaneously optimizing genetic approach to the highly excited coupled-cluster factorization problem
journal, March 2011


A general state-selective multireference coupled-cluster algorithm
journal, July 2002


Excitation Energies with Cost-Reduced Variant of the Active-Space EOMCCSDT Method: The EOMCCSDt-3̅ Approach
journal, October 2013


Multireference Nature of Chemistry: The Coupled-Cluster View
journal, December 2011


An adaptive coupled-cluster theory: @CC approach
journal, December 2010


Cache-oblivious algorithms
conference, January 1999


Works referencing / citing this record:

Efficient Tensor Sensing for RF Tomographic Imaging on GPUs
journal, February 2019


Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors
journal, June 2016