Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression

Journal Article · · International Journal of High Performance Computing Applications
 [1];  [2];  [2];  [2]
  1. Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
  2. Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

We present the GPU implementation efforts and challenges of the sparse solver package STRUMPACK. The code is made publicly available on github with a permissive BSD license. STRUMPACK implements an approximate multifrontal solver, a sparse LU factorization which makes use of compression methods to accelerate time to solution and reduce memory usage. Multiple compression schemes based on rank-structured and hierarchical matrix approximations are supported, including hierarchically semi-separable, hierarchically off-diagonal butterfly, and block low rank. Here, in this paper, we present the GPU implementation of the block low rank (BLR) compression method within a multifrontal solver. Our GPU implementation relies on highly optimized vendor libraries such as cuBLAS and cuSOLVER for NVIDIA GPUs, rocBLAS and rocSOLVER for AMD GPUs and the Intel oneAPI Math Kernel Library (oneMKL) for Intel GPUs. Additionally, we rely on external open source libraries such as SLATE (Software for Linear Algebra Targeting Exascale), MAGMA (Matrix Algebra on GPU and Multi-core Architectures), and KBLAS (KAUST BLAS). SLATE is used as a GPU-capable ScaLAPACK replacement. From MAGMA we use variable sized batched dense linear algebra operations such as GEMM, TRSM and LU with partial pivoting. KBLAS provides efficient (batched) low rank matrix compression for NVIDIA GPUs using an adaptive randomized sampling scheme. The resulting sparse solver and preconditioner runs on NVIDIA, AMD and Intel GPUs. Interfaces are available from PETSc, Trilinos and MFEM, or the solver can be used directly in user code. We report results for a range of benchmark applications, using the Perlmutter system from NERSC, Frontier from ORNL, and Aurora from ALCF. For a high frequency wave equation on a regular mesh, using 32 Perlmutter compute nodes, the factorization phase of the exact GPU solver is about 6.5× faster compared to the CPU-only solver. The BLR-enabled GPU solver is about 13.8× faster than the CPU exact solver. For a collection of SuiteSparse matrices, the STRUMPACK exact factorization on a single GPU is on average 1.9× faster than NVIDIA’s cuDSS solver.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
2499469
Journal Information:
International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 1 Vol. 39; ISSN 1094-3420
Publisher:
SAGECopyright Statement
Country of Publication:
United States
Language:
English

References (26)

A fast, memory efficient and robust sparse preconditioner based on a multifrontal approach with applications to finite‐element matrices journal February 2016
Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes journal December 1997
PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems journal February 2002
MFEM: A modular finite element methods library journal January 2021
PT-Scotch: A tool for efficient parallel graph ordering journal July 2008
High performance sparse multifrontal solvers on modern GPUs journal May 2022
Solving block low-rank linear systems by LU factorization is numerically stable journal April 2021
A Robust Parallel Preconditioner for Indefinite Systems Using Hierarchical Matrices and Randomized Sampling conference May 2017
Combinatorial BLAS 2.0: Scaling Combinatorial Algorithms on Distributed-Memory Systems journal April 2022
An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling journal January 2016
A Distributed-Memory Algorithm for Computing a Heavy-Weight Perfect Matching on Bipartite Graphs journal January 2020
Randomized GPU Algorithms for the Construction of Hierarchical Matrices from Matrix-Vector Operations journal January 2019
Sparse Approximate Multifrontal Factorization with Butterfly Compression for High-Frequency Wave Equations journal January 2021
The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices journal January 1999
A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling journal January 2001
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs journal January 1998
An overview of the Trilinos project journal September 2005
Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate journal October 2008
Algorithm 915, SuiteSparseQR journal November 2011
Kblas journal May 2016
Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures
  • Amestoy, Patrick R.; Buttari, Alfredo; L'Excellent, Jean-Yves
  • ACM Transactions on Mathematical Software, Vol. 45, Issue 1 https://doi.org/10.1145/3242094
journal March 2019
SLATE: design of a modern distributed and accelerated linear algebra library
  • Gates, Mark; Kurzak, Jakub; Charara, Ali
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356223
conference November 2019
Sparse Approximate Multifrontal Factorization with Composite Compression Methods journal September 2023
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems journal June 2003
Algorithm 832: UMFPACK V4.3---an unsymmetric-pattern multifrontal method journal June 2004
3D finite-difference frequency-domain modeling of visco-acoustic wave propagation using a massively parallel direct solver: A feasibility study journal September 2007

Similar Records

High performance sparse multifrontal solvers on modern GPUs
Journal Article · Fri Feb 04 23:00:00 EST 2022 · Parallel Computing · OSTI ID:1960514

MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures
Journal Article · Thu Jun 20 00:00:00 EDT 2024 · International Journal of High Performance Computing Applications · OSTI ID:2375895

Related Subjects