DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Abstract

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).

Authors:
 [1]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1185465
Alternate Identifier(s):
OSTI ID: 1246981
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
Computer Physics Communications
Additional Journal Information:
Journal Volume: 189; Journal ID: ISSN 0010-4655
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; tensor transpose; array reordering; tensor contraction; many-body theory; electronic structure; multireference; NVidia GPU; Intel Xeon Phi

Citation Formats

Lyakh, Dmitry I. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. United States: N. p., 2015. Web. doi:10.1016/j.cpc.2014.12.013.
Lyakh, Dmitry I. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. United States. https://doi.org/10.1016/j.cpc.2014.12.013
Lyakh, Dmitry I. Mon . "An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU". United States. https://doi.org/10.1016/j.cpc.2014.12.013. https://www.osti.gov/servlets/purl/1185465.
@article{osti_1185465,
title = {An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU},
author = {Lyakh, Dmitry I.},
abstractNote = {An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).},
doi = {10.1016/j.cpc.2014.12.013},
journal = {Computer Physics Communications},
number = ,
volume = 189,
place = {United States},
year = {Mon Jan 05 00:00:00 EST 2015},
month = {Mon Jan 05 00:00:00 EST 2015}
}

Journal Article:

Citation Metrics:
Cited by: 36 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
journal, September 2010

  • Valiev, M.; Bylaska, E. J.; Govind, N.
  • Computer Physics Communications, Vol. 181, Issue 9, p. 1477-1489
  • DOI: 10.1016/j.cpc.2010.04.018

Parallel implementation of electronic structure energy, gradient, and Hessian calculations
journal, May 2008

  • Lotrich, V.; Flocke, N.; Ponton, M.
  • The Journal of Chemical Physics, Vol. 128, Issue 19
  • DOI: 10.1063/1.2920482

Software design of ACES III with the super instruction architecture: Software design of ACES III
journal, June 2011

  • Deumens, Erik; Lotrich, Victor F.; Perera, Ajith
  • Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6
  • DOI: 10.1002/wcms.77

Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
journal, May 2006

  • Nieplocha, Jarek; Palmer, Bruce; Tipparaju, Vinod
  • The International Journal of High Performance Computing Applications, Vol. 20, Issue 2
  • DOI: 10.1177/1094342006064503

NWChem: scalable parallel computational chemistry: NWChem
journal, May 2011

  • van Dam, H. J. J.; de Jong, W. A.; Bylaska, E.
  • Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6
  • DOI: 10.1002/wcms.62

Symbolic Algebra in Quantum Chemistry
journal, January 2006


Automatic code generation for many-body electronic structure methods: the tensor contraction engine‡‡
journal, January 2006

  • Auer, Alexander A.; Baumgartner, Gerald; Bernholdt, David E.
  • Molecular Physics, Vol. 104, Issue 2
  • DOI: 10.1080/00268970500275780

Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry
journal, November 2009

  • Hartono, Albert; Lu, Qingda; Henretty, Thomas
  • The Journal of Physical Chemistry A, Vol. 113, Issue 45
  • DOI: 10.1021/jp9051215

A framework for load balancing of tensor contraction expressions via dynamic task partitioning
conference, January 2013

  • Lai, Pai-Wei; Stock, Kevin; Rajbhandari, Samyam
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
  • DOI: 10.1145/2503210.2503290

A Communication-Optimal Framework for Contracting Distributed Tensors
conference, November 2014

  • Rajbhandari, Samyam; Nikam, Akshay; Lai, Pai-Wei
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2014.36

An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster
journal, August 2010

  • Hanrath, Michael; Engels-Putzka, Anna
  • The Journal of Chemical Physics, Vol. 133, Issue 6
  • DOI: 10.1063/1.3467878

New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations
journal, July 2013

  • Epifanovsky, Evgeny; Wormit, Michael; Kuś, Tomasz
  • Journal of Computational Chemistry, Vol. 34, Issue 26
  • DOI: 10.1002/jcc.23377

An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures
journal, March 2001

  • Ding, C. H. Q.
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 12, Issue 3
  • DOI: 10.1109/71.914776

A state‐selective multireference coupled‐cluster theory employing the single‐reference formalism
journal, August 1993

  • Piecuch, Piotr; Oliphant, Nevin; Adamowicz, Ludwik
  • The Journal of Chemical Physics, Vol. 99, Issue 3
  • DOI: 10.1063/1.466179

New approach to the state-specific multireference coupled-cluster formalism
journal, June 2000

  • Adamowicz, Ludwik; Malrieu, Jean-Paul; Ivanov, Vladimir V.
  • The Journal of Chemical Physics, Vol. 112, Issue 23
  • DOI: 10.1063/1.481649

Automated generation of coupled-cluster diagrams: Implementation in the multireference state-specific coupled-cluster approach with the complete-active-space reference
journal, January 2005

  • Lyakh, Dmitry I.; Ivanov, Vladimir V.; Adamowicz, Ludwik
  • The Journal of Chemical Physics, Vol. 122, Issue 2
  • DOI: 10.1063/1.1824897

Multireference State-Specific Coupled-Cluster Theory and Multiconfigurationality Index. BH Dissociation
journal, January 2005

  • Ivanov, Vladimir V.; Adamowicz, Ludwik; Lyakh, Dmitry I.
  • Collection of Czechoslovak Chemical Communications, Vol. 70, Issue 7
  • DOI: 10.1135/cccc20051017

Multireference state-specific coupled-cluster methods. State-of-the-art and perspectives
journal, January 2009

  • Ivanov, Vladimir V.; Lyakh, Dmitry I.; Adamowicz, Ludwik
  • Physical Chemistry Chemical Physics, Vol. 11, Issue 14
  • DOI: 10.1039/b818590p

An exponential multireference wave-function Ansatz
journal, August 2005

  • Hanrath, Michael
  • The Journal of Chemical Physics, Vol. 123, Issue 8
  • DOI: 10.1063/1.1953407

A fully simultaneously optimizing genetic approach to the highly excited coupled-cluster factorization problem
journal, March 2011

  • Engels-Putzka, Anna; Hanrath, Michael
  • The Journal of Chemical Physics, Vol. 134, Issue 12
  • DOI: 10.1063/1.3561739

A general state-selective multireference coupled-cluster algorithm
journal, July 2002

  • Kállay, Mihály; Szalay, Péter G.; Surján, Péter R.
  • The Journal of Chemical Physics, Vol. 117, Issue 3
  • DOI: 10.1063/1.1483856

Excitation Energies with Cost-Reduced Variant of the Active-Space EOMCCSDT Method: The EOMCCSDt-3̅ Approach
journal, October 2013

  • Hu, Han-Shi; Kowalski, Karol
  • Journal of Chemical Theory and Computation, Vol. 9, Issue 11
  • DOI: 10.1021/ct400501z

Multireference Nature of Chemistry: The Coupled-Cluster View
journal, December 2011

  • Lyakh, Dmitry I.; Musiał, Monika; Lotrich, Victor F.
  • Chemical Reviews, Vol. 112, Issue 1
  • DOI: 10.1021/cr2001417

An adaptive coupled-cluster theory: @CC approach
journal, December 2010

  • Lyakh, Dmitry I.; Bartlett, Rodney J.
  • The Journal of Chemical Physics, Vol. 133, Issue 24
  • DOI: 10.1063/1.3515476

Cache-oblivious algorithms
conference, January 1999

  • Frigo, M.; Leiserson, C. E.; Prokop, H.
  • 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039)
  • DOI: 10.1109/SFFCS.1999.814600

Works referencing / citing this record:

Efficient Tensor Sensing for RF Tomographic Imaging on GPUs
journal, February 2019


Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors
journal, June 2016

  • Tangherloni, Andrea; Nobile, Marco S.; Cazzaniga, Paolo
  • The Journal of Supercomputing, Vol. 73, Issue 2
  • DOI: 10.1007/s11227-016-1778-8

Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors
journal, June 2016

  • Tangherloni, Andrea; Nobile, Marco S.; Cazzaniga, Paolo
  • The Journal of Supercomputing, Vol. 73, Issue 2
  • DOI: 10.1007/s11227-016-1778-8

Design of a high-performance GEMM-like Tensor-Tensor Multiplication
preprint, January 2016


TTC: A Tensor Transposition Compiler for Multiple Architectures
text, January 2016


HPTT: A High-Performance Tensor Transposition C++ Library
preprint, January 2017


Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation
text, January 2019


The landscape of software for tensor computations
preprint, January 2021