Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Distributed-Memory Sparse Deep Neural Network Inference Using Global Arrays

Conference ·

Partitioned Global Address Space (PGAS) models exhibit tremendous promise in developing efficient and productive distributed-memory parallel applications. They have been used extensively in scientific computations due to conveniently offering a ``shared-memory''-like model and convenient interfaces that separate communication with synchronization. Traditionally, PGAS communication models have been applied to dense/contiguously distributed data, but most modern applications depict varied levels of sparsity. Existing PGAS models require certain adaptations to support distributed sparse computations, since associated computations often require matrix arithmetic, in addition to data movement. The Global Arrays toolkit from Pacific Northwest National Laboratory (PNNL) is one of the earliest PGAS models to combine one-sided data communication and distributed matrix operations and is still used in the popular NWChem quantum chemistry suite. Recently, we have expanded the Global Arrays toolkit to support common sparse operations, like sparse matrix-dense matrix multiplies (SpMM), sparse matrix-sparse matrix multiplication (SpGEMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM). As it turns out, these operations are the bedrock of sparse Deep Learning (DL); sparse deep neural networks and Graph Neural Networks (GNNs) have gained increasing attention recently in achieving speedups on training and inference with reduced memory footprints. Unlike scientific applications in High Performance Computing (HPC), modern (distributed-memory capable) DL toolkits often rely on non-standardized and closed-source vendor software optimizations, creating challenges in software-hardware co-design at scale. Our goal is to support a variety of distributed-memory sparse matrix operations and helper functions in the newly created Sparse Global Arrays (SGA), such that it is possible to build portable and productive Machine Learning scenarios for algorithm/software and hardware codesign purposes. Contemporary data-parallel schemes for training/inference are undergoing a major overhaul since model replication limits scalability and causes resource inefficiencies. As such, we have adopted tensor parallelism in decomposing the model and inputs, to mitigate memory issues. Current implementation is built on top of MPI and uses CPUs to maximize the portability across the platforms.

Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
2563594
Report Number(s):
PNNL-SA-203184
Country of Publication:
United States
Language:
English

Similar Records

Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale
Journal Article · Mon May 17 00:00:00 EDT 2021 · Proceedings - IEEE International Parallel and Distributed Processing Symposium (IPDPS) · OSTI ID:1817306

Partitioning Models for Scaling Parallel Sparse Matrix-Matrix Multiplication
Journal Article · Tue Jan 02 23:00:00 EST 2018 · ACM Transactions on Parallel Computing · OSTI ID:1525287

UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0
Technical Report · Thu Sep 30 00:00:00 EDT 2021 · OSTI ID:1823253