An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Lyakh, Dmitry I.

doi:10.1016/j.cpc.2014.12.013

Title: An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Abstract

An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).

Authors:

Lyakh, Dmitry I. ^[1]

Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Publication Date:: Mon Jan 05 00:00:00 EST 2015

Research Org.:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)

Sponsoring Org.:: USDOE Office of Science (SC)

OSTI Identifier:: 1185465

Alternate Identifier(s):: OSTI ID: 1246981

Grant/Contract Number:: AC05-00OR22725

Resource Type:: Accepted Manuscript

Journal Name:: Computer Physics Communications

Additional Journal Information:: Journal Volume: 189; Journal ID: ISSN 0010-4655

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; tensor transpose; array reordering; tensor contraction; many-body theory; electronic structure; multireference; NVidia GPU; Intel Xeon Phi

Citation Formats


                    Lyakh, Dmitry I. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU.  United States: N. p., 2015. 
Web.  doi:10.1016/j.cpc.2014.12.013.

Copy to clipboard


                    Lyakh, Dmitry I. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU.  United States.  https://doi.org/10.1016/j.cpc.2014.12.013

Copy to clipboard


                    Lyakh, Dmitry I. Mon .  
"An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU".  United States.  https://doi.org/10.1016/j.cpc.2014.12.013.  https://www.osti.gov/servlets/purl/1185465.

Copy to clipboard


                    
@article{osti_1185465,

  title        = {An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU},

  author       = {Lyakh, Dmitry I.},

  abstractNote = {An efficient parallel tensor transpose algorithm is suggested for shared-memory computing units, namely, multicore CPU, Intel Xeon Phi, and NVidia GPU. The algorithm operates on dense tensors (multidimensional arrays) and is based on the optimization of cache utilization on x86 CPU and the use of shared memory on NVidia GPU. From the applied side, the ultimate goal is to minimize the overhead encountered in the transformation of tensor contractions into matrix multiplications in computer implementations of advanced methods of quantum many-body theory (e.g., in electronic structure theory and nuclear physics). A particular accent is made on higher-dimensional tensors that typically appear in the so-called multireference correlated methods of electronic structure theory. Depending on tensor dimensionality, the presented optimized algorithms can achieve an order of magnitude speedup on x86 CPUs and 2-3 times speedup on NVidia Tesla K20X GPU with respect to the na ve scattering algorithm (no memory access optimization). Furthermore, the tensor transpose routines developed in this work have been incorporated into a general-purpose tensor algebra library (TAL-SH).},

  doi          = {10.1016/j.cpc.2014.12.013},

  journal      = {Computer Physics Communications},

  number       = ,

  volume       = 189,

  place        = {United States},

  year         = {Mon Jan 05 00:00:00 EST 2015},

  month        = {Mon Jan 05 00:00:00 EST 2015}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1016/j.cpc.2014.12.013

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 36 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
journal, September 2010

Valiev, M.; Bylaska, E. J.; Govind, N.
Computer Physics Communications, Vol. 181, Issue 9, p. 1477-1489
DOI: 10.1016/j.cpc.2010.04.018

Parallel implementation of electronic structure energy, gradient, and Hessian calculations
journal, May 2008

Lotrich, V.; Flocke, N.; Ponton, M.
The Journal of Chemical Physics, Vol. 128, Issue 19
DOI: 10.1063/1.2920482

Software design of ACES III with the super instruction architecture: Software design of ACES III
journal, June 2011

Deumens, Erik; Lotrich, Victor F.; Perera, Ajith
Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6
DOI: 10.1002/wcms.77

Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
journal, May 2006

Nieplocha, Jarek; Palmer, Bruce; Tipparaju, Vinod
The International Journal of High Performance Computing Applications, Vol. 20, Issue 2
DOI: 10.1177/1094342006064503

NWChem: scalable parallel computational chemistry: NWChem
journal, May 2011

van Dam, H. J. J.; de Jong, W. A.; Bylaska, E.
Wiley Interdisciplinary Reviews: Computational Molecular Science, Vol. 1, Issue 6
DOI: 10.1002/wcms.62

Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories
journal, November 2003

Hirata, So
The Journal of Physical Chemistry A, Vol. 107, Issue 46
DOI: 10.1021/jp034596z

Symbolic Algebra in Quantum Chemistry
journal, January 2006

Hirata, So
Theoretical Chemistry Accounts, Vol. 116, Issue 1-3
DOI: 10.1007/s00214-005-0029-5

Automatic code generation for many-body electronic structure methods: the tensor contraction engine‡‡
journal, January 2006

Auer, Alexander A.; Baumgartner, Gerald; Bernholdt, David E.
Molecular Physics, Vol. 104, Issue 2
DOI: 10.1080/00268970500275780

Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry ^†
journal, November 2009

Hartono, Albert; Lu, Qingda; Henretty, Thomas
The Journal of Physical Chemistry A, Vol. 113, Issue 45
DOI: 10.1021/jp9051215

A framework for load balancing of tensor contraction expressions via dynamic task partitioning
conference, January 2013

Lai, Pai-Wei; Stock, Kevin; Rajbhandari, Samyam
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
DOI: 10.1145/2503210.2503290

A Communication-Optimal Framework for Contracting Distributed Tensors
conference, November 2014

Rajbhandari, Samyam; Nikam, Akshay; Lai, Pai-Wei
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2014.36

An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster
journal, August 2010

Hanrath, Michael; Engels-Putzka, Anna
The Journal of Chemical Physics, Vol. 133, Issue 6
DOI: 10.1063/1.3467878

New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations
journal, July 2013

Epifanovsky, Evgeny; Wormit, Michael; Kuś, Tomasz
Journal of Computational Chemistry, Vol. 34, Issue 26
DOI: 10.1002/jcc.23377

An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures
journal, March 2001

Ding, C. H. Q.
IEEE Transactions on Parallel and Distributed Systems, Vol. 12, Issue 3
DOI: 10.1109/71.914776

A state‐selective multireference coupled‐cluster theory employing the single‐reference formalism
journal, August 1993

Piecuch, Piotr; Oliphant, Nevin; Adamowicz, Ludwik
The Journal of Chemical Physics, Vol. 99, Issue 3
DOI: 10.1063/1.466179

New approach to the state-specific multireference coupled-cluster formalism
journal, June 2000

Adamowicz, Ludwik; Malrieu, Jean-Paul; Ivanov, Vladimir V.
The Journal of Chemical Physics, Vol. 112, Issue 23
DOI: 10.1063/1.481649

Automated generation of coupled-cluster diagrams: Implementation in the multireference state-specific coupled-cluster approach with the complete-active-space reference
journal, January 2005

Lyakh, Dmitry I.; Ivanov, Vladimir V.; Adamowicz, Ludwik
The Journal of Chemical Physics, Vol. 122, Issue 2
DOI: 10.1063/1.1824897

Multireference State-Specific Coupled-Cluster Theory and Multiconfigurationality Index. BH Dissociation
journal, January 2005

Ivanov, Vladimir V.; Adamowicz, Ludwik; Lyakh, Dmitry I.
Collection of Czechoslovak Chemical Communications, Vol. 70, Issue 7
DOI: 10.1135/cccc20051017

Multireference state-specific coupled-cluster methods. State-of-the-art and perspectives
journal, January 2009

Ivanov, Vladimir V.; Lyakh, Dmitry I.; Adamowicz, Ludwik
Physical Chemistry Chemical Physics, Vol. 11, Issue 14
DOI: 10.1039/b818590p

An exponential multireference wave-function Ansatz
journal, August 2005

Hanrath, Michael
The Journal of Chemical Physics, Vol. 123, Issue 8
DOI: 10.1063/1.1953407

A fully simultaneously optimizing genetic approach to the highly excited coupled-cluster factorization problem
journal, March 2011

Engels-Putzka, Anna; Hanrath, Michael
The Journal of Chemical Physics, Vol. 134, Issue 12
DOI: 10.1063/1.3561739

A general state-selective multireference coupled-cluster algorithm
journal, July 2002

Kállay, Mihály; Szalay, Péter G.; Surján, Péter R.
The Journal of Chemical Physics, Vol. 117, Issue 3
DOI: 10.1063/1.1483856

Excitation Energies with Cost-Reduced Variant of the Active-Space EOMCCSDT Method: The EOMCCSDt-3̅ Approach
journal, October 2013

Hu, Han-Shi; Kowalski, Karol
Journal of Chemical Theory and Computation, Vol. 9, Issue 11
DOI: 10.1021/ct400501z

Multireference Nature of Chemistry: The Coupled-Cluster View
journal, December 2011

Lyakh, Dmitry I.; Musiał, Monika; Lotrich, Victor F.
Chemical Reviews, Vol. 112, Issue 1
DOI: 10.1021/cr2001417

An adaptive coupled-cluster theory: @CC approach
journal, December 2010

Lyakh, Dmitry I.; Bartlett, Rodney J.
The Journal of Chemical Physics, Vol. 133, Issue 24
DOI: 10.1063/1.3515476

Cache-oblivious algorithms
conference, January 1999

Frigo, M.; Leiserson, C. E.; Prokop, H.
40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039)
DOI: 10.1109/SFFCS.1999.814600

Works referencing / citing this record:

Efficient Tensor Sensing for RF Tomographic Imaging on GPUs
journal, February 2019

Xu, Da; Zhang, Tao
Future Internet, Vol. 11, Issue 2
DOI: 10.3390/fi11020046

Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors
journal, June 2016

Tangherloni, Andrea; Nobile, Marco S.; Cazzaniga, Paolo
The Journal of Supercomputing, Vol. 73, Issue 2
DOI: 10.1007/s11227-016-1778-8

Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors
journal, June 2016

Tangherloni, Andrea; Nobile, Marco S.; Cazzaniga, Paolo
The Journal of Supercomputing, Vol. 73, Issue 2
DOI: 10.1007/s11227-016-1778-8

Exact diagonalization of quantum lattice models on coprocessors
text, January 2015

Siro, Topi; Harju, Ari
arXiv
DOI: 10.48550/arxiv.1511.00863

Design of a high-performance GEMM-like Tensor-Tensor Multiplication
preprint, January 2016

Springer, Paul; Bientinesi, Paolo
arXiv
DOI: 10.48550/arxiv.1607.00145

TTC: A Tensor Transposition Compiler for Multiple Architectures
text, January 2016

Springer, Paul; Sankaran, Aravind; Bientinesi, Paolo
arXiv
DOI: 10.48550/arxiv.1607.01249

HPTT: A High-Performance Tensor Transposition C++ Library
preprint, January 2017

Springer, Paul; Su, Tong; Bientinesi, Paolo
arXiv
DOI: 10.48550/arxiv.1704.04374

Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation
text, January 2019

Villalonga, Benjamin; Lyakh, Dmitry; Boixo, Sergio
arXiv
DOI: 10.48550/arxiv.1905.00444

The landscape of software for tensor computations
preprint, January 2021

Psarras, Christos; Karlsson, Lars; Li, Jiajia
arXiv
DOI: 10.48550/arxiv.2103.13756

Similar Records in DOE PAGES and OSTI.GOV collections:

Quantum Monte Carlo Endstation for Petascale Computing

Technical Report Ceperley, David

The major achievements enabled by QMC Endstation grant include * Performance improvement on clusters of x86 multi-core systems, especially on Cray XT systems * New and improved methods for the wavefunction optimizations * New forms of trial wavefunctions * Implementation of the full application on NVIDIA GPUs using CUDA The scaling studies of QMCPACK on large-scale systems show excellent parallel efficiency up to 216K cores on Jaguarpf (Cray XT5). The GPU implementation shows speedups of 10-15x over the CPU implementation on older generation of x86. We have implemented hybrid OpenMP/MPI scheme in QMC to take advantage of multi-core shared memorymore »« less
https://doi.org/10.2172/1007216

Full Text Available
Investigation of Portable Event-Based Monte Carlo Transport Using the NVIDIA Thrust Library

Journal Article Bleile, Ryan C. ; Department of Computer and Information Science, University of Oregon, Eugene, OR 97403 ; Brantley, Patrick S. ; ... - Transactions of the American Nuclear Society

Power consumption considerations are driving future high performance computing platforms toward many-core computing architectures. Los Alamos National Laboratory's Trinity machine, available in 2016, will use both Intel Xeon Haswell processors and Intel Xeon Phi Knights Landing many integrated core (MIC) architecture coprocessors. Lawrence Livermore National Laboratory's Sierra machine, available in 2018, will use an IBM PowerPC architecture along with Nvidia graphics processing unit (GPU) architecture accelerators. These different advanced architectures make the computing landscape in upcoming years complex. Traditional approaches to Monte Carlo transport do not work efficiently on these new computing platforms. MIC architectures require vectorization to operate efficiently,more »« less
Distributed out-of-memory NMF on CPU/GPU architectures

Journal Article Boureima, Ismael ; Bhattarai, Manish ; Eren, Maksim ; ... - Journal of Supercomputing

We propose an efficient distributed out-of-memory implementation of the non-negative matrix factorization (NMF) algorithm for heterogeneous high-performance-computing systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantlymore »« less
https://doi.org/10.1007/s11227-023-05587-4

Full Text Available
Optimizing legacy molecular dynamics software with directive-based offload

Journal Article Michael Brown, W. ; Carrillo, Jan-Michael Y. ; Gavhane, Nitin ; ... - Computer Physics Communications

The directive-based programming models are one solution for exploiting many-core coprocessors to increase simulation rates in molecular dynamics. They offer the potential to reduce code complexity with offload models that can selectively target computations to run on the CPU, the coprocessor, or both. In our paper, we describe modifications to the LAMMPS molecular dynamics code to enable concurrent calculations on a CPU and coprocessor. We also demonstrate that standard molecular dynamics algorithms can run efficiently on both the CPU and an x86-based coprocessor using the same subroutines. As a consequence, we demonstrate that code optimizations for the coprocessor also resultmore »« less
Cited by 26
https://doi.org/10.1016/j.cpc.2015.05.004

Full Text Available
Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing

Conference Gawande, Nitin A. ; Landwehr, Joshua B. ; Daily, Jeffrey A. ; ...

Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors --- including NVIDIA, Intel, AMD, and IBM --- have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating large DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these different products. This paper provides a performance and power analysis of important DL workloads on two major parallel architectures: NVIDIA DGX-1 (eight Pascal P100 GPUs interconnected with NVLink) and Intel Knights Landing (KNL) CPUs interconnected with Intel Omni-Path or Cray Aries. Ourmore »« less
https://doi.org/10.1109/IPDPSW.2017.36

Similar Records

Title: An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU

Abstract

Citation Formats

NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations journal, September 2010

Parallel implementation of electronic structure energy, gradient, and Hessian calculations journal, May 2008

Software design of ACES III with the super instruction architecture: Software design of ACES III journal, June 2011

Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit journal, May 2006

NWChem: scalable parallel computational chemistry: NWChem journal, May 2011

Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories journal, November 2003

Symbolic Algebra in Quantum Chemistry journal, January 2006

Automatic code generation for many-body electronic structure methods: the tensor contraction engine‡‡ journal, January 2006

Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry † journal, November 2009

A framework for load balancing of tensor contraction expressions via dynamic task partitioning conference, January 2013

A Communication-Optimal Framework for Contracting Distributed Tensors conference, November 2014

An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster journal, August 2010

New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations journal, July 2013

An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures journal, March 2001

A state‐selective multireference coupled‐cluster theory employing the single‐reference formalism journal, August 1993

New approach to the state-specific multireference coupled-cluster formalism journal, June 2000

Automated generation of coupled-cluster diagrams: Implementation in the multireference state-specific coupled-cluster approach with the complete-active-space reference journal, January 2005

Multireference State-Specific Coupled-Cluster Theory and Multiconfigurationality Index. BH Dissociation journal, January 2005

Multireference state-specific coupled-cluster methods. State-of-the-art and perspectives journal, January 2009

An exponential multireference wave-function Ansatz journal, August 2005

A fully simultaneously optimizing genetic approach to the highly excited coupled-cluster factorization problem journal, March 2011

A general state-selective multireference coupled-cluster algorithm journal, July 2002

Excitation Energies with Cost-Reduced Variant of the Active-Space EOMCCSDT Method: The EOMCCSDt-3̅ Approach journal, October 2013

Multireference Nature of Chemistry: The Coupled-Cluster View journal, December 2011

An adaptive coupled-cluster theory: @CC approach journal, December 2010

Cache-oblivious algorithms conference, January 1999

Efficient Tensor Sensing for RF Tomographic Imaging on GPUs journal, February 2019

Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors journal, June 2016

Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors journal, June 2016

Exact diagonalization of quantum lattice models on coprocessors text, January 2015

Design of a high-performance GEMM-like Tensor-Tensor Multiplication preprint, January 2016

TTC: A Tensor Transposition Compiler for Multiple Architectures text, January 2016

HPTT: A High-Performance Tensor Transposition C++ Library preprint, January 2017

Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation text, January 2019

The landscape of software for tensor computations preprint, January 2021

NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
journal, September 2010

Parallel implementation of electronic structure energy, gradient, and Hessian calculations
journal, May 2008

Software design of ACES III with the super instruction architecture: Software design of ACES III
journal, June 2011

Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
journal, May 2006

NWChem: scalable parallel computational chemistry: NWChem
journal, May 2011

Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories
journal, November 2003

Symbolic Algebra in Quantum Chemistry
journal, January 2006

Automatic code generation for many-body electronic structure methods: the tensor contraction engine‡‡
journal, January 2006

Performance Optimization of Tensor Contraction Expressions for Many-Body Methods in Quantum Chemistry ^†
journal, November 2009

A framework for load balancing of tensor contraction expressions via dynamic task partitioning
conference, January 2013

A Communication-Optimal Framework for Contracting Distributed Tensors
conference, November 2014

An efficient matrix-matrix multiplication based antisymmetric tensor contraction engine for general order coupled cluster
journal, August 2010

New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations
journal, July 2013

An optimal index reshuffle algorithm for multidimensional arrays and its applications for parallel architectures
journal, March 2001

A state‐selective multireference coupled‐cluster theory employing the single‐reference formalism
journal, August 1993

New approach to the state-specific multireference coupled-cluster formalism
journal, June 2000

Automated generation of coupled-cluster diagrams: Implementation in the multireference state-specific coupled-cluster approach with the complete-active-space reference
journal, January 2005

Multireference State-Specific Coupled-Cluster Theory and Multiconfigurationality Index. BH Dissociation
journal, January 2005

Multireference state-specific coupled-cluster methods. State-of-the-art and perspectives
journal, January 2009

An exponential multireference wave-function Ansatz
journal, August 2005

A fully simultaneously optimizing genetic approach to the highly excited coupled-cluster factorization problem
journal, March 2011

A general state-selective multireference coupled-cluster algorithm
journal, July 2002

Excitation Energies with Cost-Reduced Variant of the Active-Space EOMCCSDT Method: The EOMCCSDt-3̅ Approach
journal, October 2013

Multireference Nature of Chemistry: The Coupled-Cluster View
journal, December 2011

An adaptive coupled-cluster theory: @CC approach
journal, December 2010

Cache-oblivious algorithms
conference, January 1999

Efficient Tensor Sensing for RF Tomographic Imaging on GPUs
journal, February 2019

Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors
journal, June 2016

Gillespie’s Stochastic Simulation Algorithm on MIC coprocessors
journal, June 2016

Exact diagonalization of quantum lattice models on coprocessors
text, January 2015

Design of a high-performance GEMM-like Tensor-Tensor Multiplication
preprint, January 2016

TTC: A Tensor Transposition Compiler for Multiple Architectures
text, January 2016

HPTT: A High-Performance Tensor Transposition C++ Library
preprint, January 2017

Establishing the Quantum Supremacy Frontier with a 281 Pflop/s Simulation
text, January 2019

The landscape of software for tensor computations
preprint, January 2021