Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures
- BATTELLE (PACIFIC NW LAB)
- Oak Ridge National Laboratory
- University of Sydney
Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the GPUs. This is particularly the case for \textit{Sparse Triangular Solver (SpTRSV)} which introduces additional two-dimensional computation dependencies among subsequent computation steps. Dependency information is exchanged and shared among GPUs, thus warrant for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve on average 3.53x (up to 9.86x) speedup on a DGX-1 system and 3.66x (up to 9.64x) speedup on a DGX-2 system with 4-GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU system.
- Research Organization:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1830211
- Report Number(s):
- PNNL-SA-150878
- Country of Publication:
- United States
- Language:
- English
Similar Records
Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters
Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect
GSoFa: Scalable Sparse Symbolic LU Factorization on GPUs
Conference
·
Wed Nov 01 00:00:00 EDT 2023
·
OSTI ID:2438981
Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect
Journal Article
·
Tue Dec 31 23:00:00 EST 2019
· IEEE Transactions on Parallel and Distributed Systems
·
OSTI ID:1598812
GSoFa: Scalable Sparse Symbolic LU Factorization on GPUs
Journal Article
·
Fri Apr 01 00:00:00 EDT 2022
· IEEE Transactions on Parallel and Distributed Systems
·
OSTI ID:1960228