Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters

Conference ·
OSTI ID:2438981
This paper presents a unified communication optimization framework for sparse triangular solve (SpTRSV) algorithms on CPU and GPU clusters. The framework builds upon a 3D communication-avoiding (CA) layout of Px × Py × Pz processes that divides a sparse matrix into Pz submatrices, each handled by a Px × Py 2D grid with block-cyclic distribution. We propose three communication optimization strategies: First, a new 3D SpTRSV algorithm is developed, which trades the inter-grid communication and synchronization with replicated computation. This design requires only one inter-grid synchronization, and the inter-grid communication is efficiently implemented with sparse allreduce operations. Second, broadcast and reduction communication trees are used to reduce message latency of the intra-grid 2D communication on CPU clusters. Finally, we leverage GPU-initiated one-sided communication to implement the communication trees on GPU clusters. With these nested inter- and intra-grid communication optimization strategies, the proposed 3D SpTRSV algorithm can attain up to 3.45x speedups compared to the baseline 3D SpTRSV algorithm using up to 2048 Cori Haswell CPU cores. In addition, the proposed GPU 3D SpTRSV algorithm can achieve up to 6.5x speedups compared to the proposed CPU 3D SpTRSV algorithm with Pz up to 64. Finally it is remarkable that the proposed GPU 3D SpTRSV can scale to 256 GPUs using the Perlmutter system while the existing 2D SpTRSV algorithm can only scale up to 4 GPUs.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE; USDOE Office of Science (SC)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
2438981
Country of Publication:
United States
Language:
English

Similar Records

A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems
Journal Article · Sun Aug 18 20:00:00 EDT 2019 · Journal of Parallel and Distributed Computing · OSTI ID:1559632

Distributed out-of-memory NMF on CPU/GPU architectures
Journal Article · Thu Sep 07 20:00:00 EDT 2023 · Journal of Supercomputing · OSTI ID:2246858

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures
Conference · Mon Aug 09 00:00:00 EDT 2021 · OSTI ID:1830211

Related Subjects