DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: GSoFa: Scalable Sparse Symbolic LU Factorization on GPUs

Journal Article · · IEEE Transactions on Parallel and Distributed Systems
ORCiD logo [1];  [2];  [1]
  1. Stevens Institute of Technology, Hoboken, NJ (United States)
  2. Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Decomposing a matrix $$\mathbf {A}$$ into a lower matrix $$\mathbf {L}$$ and an upper matrix $$\mathbf {U}$$, which is also known as LU decomposition, is an essential operation in numerical linear algebra. For a sparse matrix, LU decomposition often introduces more nonzero entries in the $$\mathbf {L}$$ and $$\mathbf {U}$$ factors than in the original matrix. A symbolic factorization step is needed to identify the nonzero structures of $$\mathbf {L}$$ and $$\mathbf {U}$$ matrices. Attracted by the enormous potentials of the Graphics Processing Units (GPUs), an array of efforts have surged to deploy various LU factorization steps except for the symbolic factorization, to the best of our knowledge, on GPUs. This article introduces gSoFa, the first GPU-based symbolic factorization design with the following three optimizations to enable scalable LU symbolic factorization for nonsymmetric pattern sparse matrices on GPUs. First, here we introduce a novel fine-grained parallel symbolic factorization algorithm that is well suited for the Single Instruction Multiple Thread (SIMT) architecture of GPUs. Second, we tailor supernode detection into a SIMT friendly process and strive to balance the workload, minimize the communication and saturate the GPU computing resources during supernode detection. Third, we introduce a three-pronged optimization to reduce the excessive space consumption problem faced by multi-source concurrent symbolic factorization. Taken together, gSoFa achieves up to 31× speedup from 1 to 44 Summit nodes (6 to 264 GPUs) and outperforms the state-of-the-art CPU project, on average, by 5×. Notably, gSoFa also achieves up to 47 percent of the peak memory throughput of a V100 GPU in the Summit Supercomputer.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF); CAREER; Exascale Computing; USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC02-05CH11231; 2000722; 2046102; 17-SC-20-SC
OSTI ID:
1960228
Journal Information:
IEEE Transactions on Parallel and Distributed Systems, Vol. 33, Issue 4; ISSN 1045-9219
Publisher:
IEEECopyright Statement
Country of Publication:
United States
Language:
English

References (29)

GPU-based LU decomposition for large method of moments problems journal January 2010
A note on two problems in connexion with graphs journal December 1959
Sparse LU factorization for parallel circuit simulation on GPU conference June 2012
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems journal June 2003
Reducing the bandwidth of sparse symmetric matrices conference January 1969
The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices journal January 1999
Enterprise: breadth-first graph traversal on GPUs
  • Liu, Hang; Huang, H. Howie
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807594
conference November 2015
A Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering journal January 1998
Algorithmic Aspects of Vertex Elimination on Graphs journal June 1976
A supernodal all-pairs shortest path algorithm conference February 2020
Direct solutions of sparse network equations by optimally ordered triangular factorization journal January 1967
Gunrock conference February 2016
A survey of direct methods for sparse linear systems journal May 2016
An Efficient Heuristic Procedure for Partitioning Graphs journal February 1970
Parallel Symbolic Factorization for Sparse LU with Static Pivoting journal January 2007
Sparse Partial Pivoting in Time Proportional to Arithmetic Operations journal September 1988
Exploiting Structural Symmetry in Unsymmetric Sparse Symbolic Factorization journal January 1992
Elimination Structures for Unsymmetric Sparse $LU$ Factors journal April 1993
Compiler assisted hybrid implicit and explicit GPU memory management under unified address space
  • Li, Lingda; Chapman, Barbara
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356141
conference November 2019
iBFS conference June 2016
Xbfs conference June 2019
GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation journal June 2020
A Supernodal Approach to Sparse Partial Pivoting journal January 1999
A Distributed CPU-GPU Sparse Direct Solver book January 2014
Scalable GPU graph traversal journal September 2012
An Efficient Heuristic Procedure for Partitioning Graphs journal February 1970
Techniques for parallel manipulation of sparse matrices journal May 1989
Δ-stepping: a parallelizable shortest path algorithm journal October 2003
A Comparison of Some Methods for the Solution of Sparse Overdetermined Systems of Linear Equations journal January 1976

Figures / Tables (24)