An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- Grant/Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1185465
- Alternate ID(s):
- OSTI ID: 1246981
- Journal Information:
- Computer Physics Communications, Vol. 189; ISSN 0010-4655
- Publisher:
- ElsevierCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Web of Science
Efficient Tensor Sensing for RF Tomographic Imaging on GPUs
|
journal | February 2019 |
Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors
|
journal | June 2016 |
Exact diagonalization of quantum lattice models on coprocessors | text | January 2015 |
Design of a high-performance GEMM-like Tensor-Tensor Multiplication | preprint | January 2016 |
TTC: A Tensor Transposition Compiler for Multiple Architectures | text | January 2016 |
HPTT: A High-Performance Tensor Transposition C++ Library | preprint | January 2017 |
Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation | text | January 2019 |
The landscape of software for tensor computations | preprint | January 2021 |
Similar Records
Investigation of Portable Event-Based Monte Carlo Transport Using the NVIDIA Thrust Library
Distributed out-of-memory NMF on CPU/GPU architectures