A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression

Claus, Lisa; Ghysels, Pieter; Boukaram, Wajih Halim; Li, Xiaoye Sherry

doi:10.1177/10943420241288567

A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression

Journal Article · Mon Sep 30 00:00:00 EDT 2024 · International Journal of High Performance Computing Applications

DOI:https://doi.org/10.1177/10943420241288567· OSTI ID:2499469

^[1]; Ghysels, Pieter ^[2]; Boukaram, Wajih Halim ^[2]; ^[2]

Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States). National Energy Research Scientific Computing Center (NERSC)
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

We present the GPU implementation efforts and challenges of the sparse solver package STRUMPACK. The code is made publicly available on github with a permissive BSD license. STRUMPACK implements an approximate multifrontal solver, a sparse LU factorization which makes use of compression methods to accelerate time to solution and reduce memory usage. Multiple compression schemes based on rank-structured and hierarchical matrix approximations are supported, including hierarchically semi-separable, hierarchically off-diagonal butterfly, and block low rank. Here, in this paper, we present the GPU implementation of the block low rank (BLR) compression method within a multifrontal solver. Our GPU implementation relies on highly optimized vendor libraries such as cuBLAS and cuSOLVER for NVIDIA GPUs, rocBLAS and rocSOLVER for AMD GPUs and the Intel oneAPI Math Kernel Library (oneMKL) for Intel GPUs. Additionally, we rely on external open source libraries such as SLATE (Software for Linear Algebra Targeting Exascale), MAGMA (Matrix Algebra on GPU and Multi-core Architectures), and KBLAS (KAUST BLAS). SLATE is used as a GPU-capable ScaLAPACK replacement. From MAGMA we use variable sized batched dense linear algebra operations such as GEMM, TRSM and LU with partial pivoting. KBLAS provides efficient (batched) low rank matrix compression for NVIDIA GPUs using an adaptive randomized sampling scheme. The resulting sparse solver and preconditioner runs on NVIDIA, AMD and Intel GPUs. Interfaces are available from PETSc, Trilinos and MFEM, or the solver can be used directly in user code. We report results for a range of benchmark applications, using the Perlmutter system from NERSC, Frontier from ORNL, and Aurora from ALCF. For a high frequency wave equation on a regular mesh, using 32 Perlmutter compute nodes, the factorization phase of the exact GPU solver is about 6.5× faster compared to the CPU-only solver. The BLR-enabled GPU solver is about 13.8× faster than the CPU exact solver. For a collection of SuiteSparse matrices, the STRUMPACK exact factorization on a single GPU is on average 1.9× faster than NVIDIA’s cuDSS solver.

View Accepted Manuscript (DOE)

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)

Grant/Contract Number:: AC02-05CH11231

OSTI ID:: 2499469

Journal Information:: International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 1 Vol. 39; ISSN 1094-3420

Publisher:: SAGECopyright Statement

Country of Publication:: United States

Language:: English

References (26)

A fast, memory efficient and robust sparse preconditioner based on a multifrontal approach with applications to finite‐element matrices Aminfar, AmirHossein; Darve, Eric International Journal for Numerical Methods in Engineering, Vol. 107, Issue 6 https://doi.org/10.1002/nme.5196	journal	February 2016
Multilevel Diffusion Schemes for Repartitioning of Adaptive Meshes Schloegel, Kirk; Karypis, George; Kumar, Vipin Journal of Parallel and Distributed Computing, Vol. 47, Issue 2 https://doi.org/10.1006/jpdc.1997.1410	journal	December 1997
PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems Hénon, P.; Ramet, P.; Roman, J. Parallel Computing, Vol. 28, Issue 2 https://doi.org/10.1016/S0167-8191(01)00141-7	journal	February 2002
MFEM: A modular finite element methods library Anderson, Robert; Andrej, Julian; Barker, Andrew Computers & Mathematics with Applications, Vol. 81 https://doi.org/10.1016/j.camwa.2020.06.009	journal	January 2021
PT-Scotch: A tool for efficient parallel graph ordering Chevalier, C.; Pellegrini, F. Parallel Computing, Vol. 34, Issue 6-8 https://doi.org/10.1016/j.parco.2007.12.001	journal	July 2008
High performance sparse multifrontal solvers on modern GPUs Ghysels, Pieter; Synk, Ryan Parallel Computing, Vol. 110 https://doi.org/10.1016/j.parco.2022.102897	journal	May 2022
Solving block low-rank linear systems by LU factorization is numerically stable Higham, Nicholas J.; Mary, Theo IMA Journal of Numerical Analysis, Vol. 42, Issue 2 https://doi.org/10.1093/imanum/drab020	journal	April 2021
A Robust Parallel Preconditioner for Indefinite Systems Using Hierarchical Matrices and Randomized Sampling Ghysels, Pieter; Xiaoye, Sherry Li; Gorman, Christopher 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2017.21	conference	May 2017
Combinatorial BLAS 2.0: Scaling Combinatorial Algorithms on Distributed-Memory Systems Azad, Ariful; Selvitopi, Oguz; Hussain, Md Taufique IEEE Transactions on Parallel and Distributed Systems, Vol. 33, Issue 4 https://doi.org/10.1109/TPDS.2021.3094091	journal	April 2022
An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling Ghysels, Pieter; Li, Xiaoye S.; Rouet, François-Henry SIAM Journal on Scientific Computing, Vol. 38, Issue 5 https://doi.org/10.1137/15M1010117	journal	January 2016
A Distributed-Memory Algorithm for Computing a Heavy-Weight Perfect Matching on Bipartite Graphs Azad, Ariful; Buluç, Aydin; Li, Xiaoye S. SIAM Journal on Scientific Computing, Vol. 42, Issue 4 https://doi.org/10.1137/18M1189348	journal	January 2020
Randomized GPU Algorithms for the Construction of Hierarchical Matrices from Matrix-Vector Operations Boukaram, Wajih; Turkiyyah, George; Keyes, David SIAM Journal on Scientific Computing, Vol. 41, Issue 4 https://doi.org/10.1137/18M1210101	journal	January 2019
Sparse Approximate Multifrontal Factorization with Butterfly Compression for High-Frequency Wave Equations Liu, Yang; Ghysels, Pieter; Claus, Lisa SIAM Journal on Scientific Computing, Vol. 43, Issue 5 https://doi.org/10.1137/20M1349667	journal	January 2021
The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices Duff, Iain S.; Koster, Jacko SIAM Journal on Matrix Analysis and Applications, Vol. 20, Issue 4 https://doi.org/10.1137/S0895479897317661	journal	January 1999
A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling Amestoy, Patrick R.; Duff, Iain S.; L'Excellent, Jean-Yves SIAM Journal on Matrix Analysis and Applications, Vol. 23, Issue 1 https://doi.org/10.1137/S0895479899358194	journal	January 2001
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs Karypis, George; Kumar, Vipin SIAM Journal on Scientific Computing, Vol. 20, Issue 1 https://doi.org/10.1137/S1064827595287997	journal	January 1998
An overview of the Trilinos project Heroux, Michael A.; Phipps, Eric T.; Salinger, Andrew G. ACM Transactions on Mathematical Software, Vol. 31, Issue 3 https://doi.org/10.1145/1089014.1089021	journal	September 2005
Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate Chen, Yanqing; Davis, Timothy A.; Hager, William W. ACM Transactions on Mathematical Software, Vol. 35, Issue 3 https://doi.org/10.1145/1391989.1391995	journal	October 2008
Algorithm 915, SuiteSparseQR Davis, Timothy A. ACM Transactions on Mathematical Software, Vol. 38, Issue 1 https://doi.org/10.1145/2049662.2049670	journal	November 2011
Kblas Abdelfattah, Ahmad; Keyes, David; Ltaief, Hatem ACM Transactions on Mathematical Software, Vol. 42, Issue 3 https://doi.org/10.1145/2818311	journal	May 2016
Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures Amestoy, Patrick R.; Buttari, Alfredo; L'Excellent, Jean-Yves ACM Transactions on Mathematical Software, Vol. 45, Issue 1 https://doi.org/10.1145/3242094	journal	March 2019
SLATE: design of a modern distributed and accelerated linear algebra library Gates, Mark; Kurzak, Jakub; Charara, Ali SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356223	conference	November 2019
Sparse Approximate Multifrontal Factorization with Composite Compression Methods Claus, Lisa; Ghysels, Pieter; Liu, Yang ACM Transactions on Mathematical Software, Vol. 49, Issue 3 https://doi.org/10.1145/3611662	journal	September 2023
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems Li, Xiaoye S.; Demmel, James W. ACM Transactions on Mathematical Software, Vol. 29, Issue 2 https://doi.org/10.1145/779359.779361	journal	June 2003
Algorithm 832: UMFPACK V4.3---an unsymmetric-pattern multifrontal method Davis, Timothy A. ACM Transactions on Mathematical Software, Vol. 30, Issue 2 https://doi.org/10.1145/992200.992206	journal	June 2004
3D finite-difference frequency-domain modeling of visco-acoustic wave propagation using a massively parallel direct solver: A feasibility study Operto, Stéphane; Virieux, Jean; Amestoy, Patrick GEOPHYSICS, Vol. 72, Issue 5 https://doi.org/10.1190/1.2759835	journal	September 2007

Similar Records

High performance sparse multifrontal solvers on modern GPUs

Journal Article · Fri Feb 04 23:00:00 EST 2022 · Parallel Computing · OSTI ID:1960514

MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures

Journal Article · Thu Jun 20 00:00:00 EDT 2024 · International Journal of High Performance Computing Applications · OSTI ID:2375895

Related Subjects

97 MATHEMATICS AND COMPUTING

A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression

Citation Formats

References (26)

Similar Records

Related Subjects