Tensor Contraction and Operation Minimization forExtreme Scale Computational Chemistry

Sabin, Gerald; Sadayappan, P.

Title: Tensor Contraction and Operation Minimization forExtreme Scale Computational Chemistry

Technical Report · Wed Feb 17 00:00:00 EST 2021

OSTI ID:1782724

Sabin, Gerald ^[1]; Sadayappan, P. ^[2]

RNET Technologies
The University of Utah

During Phase 1, RNET and U. Utah accomplished four main tasks to demonstrate the technical and practical feasibility of the ES-SciLA tensor library. Evaluated the SpMM runtime performance using the KokkosKernels framework, i.e., the multi-vector 2D Tensor KokkosSparse :: spmv interface. KokkosKernels represents a high productivity library that is an important part of the DOE exascale platform. It provides shared memory performance portability across a range of hardware architectures and is a target for planned future integration is ES-SciLA. The results include matrices from DARPA’s SDNN Graph Challenge, a Transformer model [8], structural and power law matrices from the Suite Sparse Matrix Library [7], and a set of synthetic banded matrices. For these test cases, KokkosKernels achieves between 0.5 and 10 GFLOPS (see Figure 1). Therefore, there is significant scope to develop a performance portable set of implementations and integrate them into KokkosKernels.Designed and prototyped a Non-uniform Aligned Blocking (NAB) scheme that blocks a matrix based on its non-zero sparsity pattern for an example blocking). NAB has several benefits for shared memory linear algebra operations; blocks are designed to maximize cache and register reuse, aligned blocks allow for implementations that distribute row panels and do not require atomics or reductions on the output matrix (e.g., our CPU NAB based SpMM implementation), flexible blocking criteria (to support a range of algorithms), and they naturally adapt to structure of the matrix. In addition, the aligned blocking concept can be used to develop distributed memory partitioning schemes that balance load across processors (by tracking the expected number of multiple-add operations required per block pair) and minimize the required off-node memory ac- cesses (by maximizing the size of the blocks used in each block-block operation). The number of non-zeros per active column in the blocks can be used as a metric that provides information on the sparsity pattern. Matrices that have blocks with few non-zeros per active column will be unable to leverage cache or shared memory effectively, and hence strategies that maximize output register reuse and focus on non-cache optimizations are more likely to perform well. Note, our current implementations focus on data reuse, and hence perform best for matrices with a high number of non-zeros per active column. Implemented NAB-based operations that focus on leveraging the exposed dense block for SpMM (CPU and GPU) and SDDMM (CPU). The SpMM implementations outperform (by upto 80+%) the vendor libraries (e.g., Intel’s MKL or Nvidia’s Cusparse for matrices with a sufficient number of non-zeros per active column (including the Transformer matrices, the high-bandwidth synthetic matrices, and several of the structural matrices (pwtk, cant, and pdb1HYS). The SDDMM implementation outperforms TACO (no vendor library is available) for these same cases by up to 5x. For the SDDMM implementation, NAB also outperforms TACO for the structural matrices (2cubes, cant, pbd1HYS, and pwtk) by between 1.36x and 4.8x, and the long tail matrices (web-base, web-BerkStan, and web-NotreDame by between 2x and 2.4x. for the remaining matrices the performance is qualitatively similar (within 10%). Developed a distributed memory partitioning plan using the aligned blocking concepts. The partitioning will allow efficient distributed-memory implementation of the tensor operations. For instance, for sparse-sparse matrix multiplication the aligned sparse blocks will allow a load balanced partitioning that maximizes locality in order to reduce the volume of data that is transferred between distributed memory nodes.

This content will become available on Sat Aug 18 00:00:00 EDT 2040.

Cite

Export

Save

Research Organization:: RNET Technologies

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: SC0020616

OSTI ID:: 1782724

Type / Phase:: SBIR (Phase I)

Report Number(s):: DOE-RNET-20616

Country of Publication:: United States

Language:: English

Similar Records

A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations

Journal Article · Thu Jun 01 00:00:00 EDT 2017 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1782724

Aktulga, Hasan Metin; Afibuzzaman, Md.; Williams, Samuel; +6 more

Fast sparse matrix-vector multiplication by exploiting variable block structure

Technical Report · Thu Jul 07 00:00:00 EDT 2005 · OSTI ID:1782724

Vuduc, R W; Moon, H

On the performance and energy efficiency of sparse linear algebra on GPUs

Journal Article · Wed Oct 05 00:00:00 EDT 2016 · International Journal of High Performance Computing Applications · OSTI ID:1782724

Anzt, Hartwig; Tomov, Stanimire; Dongarra, Jack

Related Subjects

97 MATHEMATICS AND COMPUTING
Tensor Operations
Exascale Computing
NWChem
Matrix Tiling
Sparsity Adaptive
GPU
Multi-Core

Title: Tensor Contraction and Operation Minimization forExtreme Scale Computational Chemistry

Citation Formats

Similar Records

Related Subjects