MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures

Abdelfattah, Ahmad; Beams, Natalie; Carson, Robert; Ghysels, Pieter; Kolev, Tzanio; Stitt, Thomas; Vargas, Arturo; Tomov, Stanimire; Dongarra, Jack

doi:10.1177/10943420241261960

MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures

Journal Article · Thu Jun 20 00:00:00 EDT 2024 · International Journal of High Performance Computing Applications

DOI:https://doi.org/10.1177/10943420241261960· OSTI ID:2375895

^[1]; ^[1]; ^[2]; Ghysels, Pieter ^[3]; ^[2]; Stitt, Thomas ^[2]; ^[2]; Tomov, Stanimire ^[1]; Dongarra, Jack ^[1]

Innovative Computing Laboratory, University of Tennessee, Knoxville, TN, USA
Lawrence Livermore National Laboratory, Livermore, CA, USA
Lawrence Berkeley National Laboratory, Berkeley, CA, USA

MAGMA (Matrix Algebra for GPU and Multicore Architectures) is a pivotal open-source library in the landscape of GPU-enabled dense and sparse linear algebra computations. With a repertoire of approximately 750 numerical routines across four precisions, MAGMA is deeply ingrained in the DOE software stack, playing a crucial role in high-performance computing. Notable projects such as ExaConstit, HiOP, MARBL, and STRUMPACK, among others, directly harness the capabilities of MAGMA. In addition, the MAGMA development team has been acknowledged multiple times for contributing to the vendors’ numerical software stacks. Looking back over the time of the Exascale Computing Project (ECP), we highlight how MAGMA has adapted to recent changes in modern HPC systems, especially the growing gap between CPU and GPU compute capabilities, as well as the introduction of low precision arithmetic in modern GPUs. We also describe MAGMA’s direct impact on several ECP projects. Maintaining portable performance across NVIDIA and AMD GPUs, and with current efforts toward supporting Intel GPUs, MAGMA ensures its adaptability and relevance in the ever-evolving landscape of GPU architectures.

View Journal Article

Research Organization:: Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)

Sponsoring Organization:: USDOE; USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); USDOE Office of Science (SC), Basic Energy Sciences (BES). Scientific User Facilities (SUF)

Grant/Contract Number:: AC02-06CH11357; AC52-07NA27344

OSTI ID:: 2375895

Alternate ID(s):: OSTI ID: 2429364

Report Number(s):: LLNL--JRNL-860479; 10943420241261960

Journal Information:: International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications; ISSN 1094-3420

Publisher:: SAGE PublicationsCopyright Statement

Country of Publication:: United States

Language:: English

References (68)

Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL Reinders, James; Ashbaugh, Ben; Brodman, James https://doi.org/10.1007/978-1-4842-5574-2	book	November 2020
Batch QR Factorization on GPUs: Design, Optimization, and Tuning Abdelfattah, Ahmad; Tomov, Stan; Dongarra, Jack Computational Science – ICCS 2022, p. 60-74 https://doi.org/10.1007/978-3-031-08751-6_5	book	June 2022
Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs Chow, Edmond; Anzt, Hartwig; Dongarra, Jack Lecture Notes in Computer Science, p. 1-16 https://doi.org/10.1007/978-3-319-20119-1_1	book	January 2015
Performance, Design, and Autotuning of Batched GEMM for GPUs Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire Lecture Notes in Computer Science, p. 21-38 https://doi.org/10.1007/978-3-319-41321-1_2	book	June 2016
Euro-Par 2016: Parallel Processing Dutot, Pierre-François; Trystram, Denis Lecture Notes in Computer Science https://doi.org/10.1007/978-3-319-43659-3	book	August 2016
Multicore and Accelerator Development for a Leadership-Class Stellar Astrophysics Code Messer, O. E. Bronson; Harris, J. Austin; Parete-Koon, Suzanne Applied Parallel and Scientific Computing, p. 92-106 https://doi.org/10.1007/978-3-642-36803-5_6	book	January 2013
PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems Hénon, P.; Ramet, P.; Roman, J. Parallel Computing, Vol. 28, Issue 2 https://doi.org/10.1016/S0167-8191(01)00141-7	journal	February 2002
MFEM: A modular finite element methods library Anderson, Robert; Andrej, Julian; Barker, Andrew Computers & Mathematics with Applications, Vol. 81 https://doi.org/10.1016/j.camwa.2020.06.009	journal	January 2021
Accelerating scientific computations with mixed precision algorithms Baboulin, Marc; Buttari, Alfredo; Dongarra, Jack Computer Physics Communications, Vol. 180, Issue 12 https://doi.org/10.1016/j.cpc.2008.11.005	journal	December 2009
Batched one-sided factorizations of tiny matrices using GPUs: Challenges and countermeasures Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire Journal of Computational Science, Vol. 26 https://doi.org/10.1016/j.jocs.2018.01.005	journal	May 2018
Towards dense linear algebra for hybrid GPU accelerated manycore systems Tomov, Stanimire; Dongarra, Jack; Baboulin, Marc Parallel Computing, Vol. 36, Issue 5-6 https://doi.org/10.1016/j.parco.2009.12.005	journal	June 2010
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing Tomov, Stanimire; Nath, Rajib; Dongarra, Jack Parallel Computing, Vol. 36, Issue 12 https://doi.org/10.1016/j.parco.2010.06.001	journal	December 2010
Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression Boukaram, Wajih Halim; Turkiyyah, George; Ltaief, Hatem Parallel Computing, Vol. 74 https://doi.org/10.1016/j.parco.2017.09.001	journal	May 2018
High performance sparse multifrontal solvers on modern GPUs Ghysels, Pieter; Synk, Ryan Parallel Computing, Vol. 110 https://doi.org/10.1016/j.parco.2022.102897	journal	May 2022
Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire Procedia Computer Science, Vol. 80 https://doi.org/10.1016/j.procs.2016.05.303	journal	January 2016
Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire Procedia Computer Science, Vol. 108 https://doi.org/10.1016/j.procs.2017.05.250	journal	January 2017
Automatic code generation for many-body electronic structure methods: the tensor contraction engine‡‡ Auer, Alexander A.; Baumgartner, Gerald; Bernholdt, David E. Molecular Physics, Vol. 104, Issue 2 https://doi.org/10.1080/00268970500275780	journal	January 2006
Exascale applications: skin in the game Alexander, Francis; Almgren, Ann; Bell, John Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 378, Issue 2166 https://doi.org/10.1098/rsta.2019.0056	journal	January 2020
Progressive Optimization of Batched LU Factorization on GPUs Abdelfattah, Ahmad; Tomov, Stanimire; Dongarra, Jack 2019 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC.2019.8916270	conference	September 2019
Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs Brown, Cade; Abdelfattah, Ahmad; Tomov, Stanimire 2020 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC43674.2020.9286214	conference	September 2020
A Predictive Model for Solving Small Linear Algebra Problems in GPU Registers Anderson, Michael J.; Sheffield, David; Keutzer, Kurt 2012 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2012 IEEE 26th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2012.11	conference	May 2012
Fast Batched Matrix Multiplication for Small Sizes Using Half-Precision Arithmetic on GPUs Abdelfattah, Ahmad; Tomov, Stanimire; Dongarra, Jack 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2019.00022	conference	May 2019
Dense linear algebra solvers for multicore with GPU accelerators Tomov, Stanimire; Nath, Rajib; Ltaief, Hatem 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW) https://doi.org/10.1109/IPDPSW.2010.5470941	conference	April 2010
On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2016.190	conference	May 2016
RAJA: Portable Performance for Large-Scale Scientific Applications Beckingsale, David A.; Scogland, Thomas RW; Burmark, Jason 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) https://doi.org/10.1109/P3HPC49587.2019.00012	conference	November 2019
Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers Haidar, Azzam; Tomov, Stanimire; Dongarra, Jack SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00050	conference	November 2018
Addressing Irregular Patterns of Matrix Computations on GPUs and Their Impact on Applications Powered by Sparse Direct Solvers Abdelfattah, Ahmad; Ghysels, Pieter; Boukaram, Wajih SC22: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC41404.2022.00031	conference	November 2022
LAPACK: A portable linear algebra library for high-performance computers Angerson, E.; Sorensen, D.; Bai, Z. Proceedings SUPERCOMPUTING '90 https://doi.org/10.1109/SUPERC.1990.129995	conference	January 1990
Batched Generation of Incomplete Sparse Approximate Inverses on GPUs Anzt, Hartwig; Chow, Edmond; Huckle, Thomas 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) https://doi.org/10.1109/ScalA.2016.011	conference	November 2016
High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs Beams, Natalie; Abdelfattah, Ahmad; Tomov, Stan 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) https://doi.org/10.1109/ScalA51936.2020.00012	conference	November 2020
Autotuning GEMM Kernels for the Fermi GPU Kurzak, Jakub; Tomov, Stanimire; Dongarra, Jack IEEE Transactions on Parallel and Distributed Systems, Vol. 23, Issue 11 https://doi.org/10.1109/TPDS.2011.311	journal	November 2012
Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire IEEE Transactions on Parallel and Distributed Systems, Vol. 29, Issue 12 https://doi.org/10.1109/TPDS.2018.2842785	journal	December 2018
Extending MAGMA Portability with OneAPI Fortenberry, Anna; Tomov, Stanimire 2022 Workshop on Accelerator Programming Using Directives (WACCPD) https://doi.org/10.1109/WACCPD56842.2022.00008	conference	November 2022
Performance Portable Graphics Processing Unit Acceleration of a High-Order Finite Element Multiphysics Application Stitt, Thomas; Belcher, Kristi; Campos, Alejandro Journal of Fluids Engineering, Vol. 146, Issue 4 https://doi.org/10.1115/1.4064493	journal	February 2024
GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems Saad, Youcef; Schultz, Martin H. SIAM Journal on Scientific and Statistical Computing, Vol. 7, Issue 3 https://doi.org/10.1137/0907058	journal	July 1986
A Flexible Inner-Outer Preconditioned GMRES Algorithm Saad, Youcef SIAM Journal on Scientific Computing, Vol. 14, Issue 2 https://doi.org/10.1137/0914028	journal	March 1993
Applied Numerical Linear Algebra Demmel, James W. Society for Industrial and Applied Mathematics https://doi.org/10.1137/1.9781611971446	book	January 1997
Rounding Errors in Algebraic Processes Wilkinson, James Hardy https://doi.org/10.1137/1.9781611977523	book	January 1966
An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling Ghysels, Pieter; Li, Xiaoye S.; Rouet, François-Henry SIAM Journal on Scientific Computing, Vol. 38, Issue 5 https://doi.org/10.1137/15M1010117	journal	January 2016
A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems Carson, Erin; Higham, Nicholas J. SIAM Journal on Scientific Computing, Vol. 39, Issue 6 https://doi.org/10.1137/17M1122918	journal	January 2017
Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions Carson, Erin; Higham, Nicholas J. SIAM Journal on Scientific Computing, Vol. 40, Issue 2 https://doi.org/10.1137/17M1140819	journal	January 2018
Flexible Inner-Outer Krylov Subspace Methods Simoncini, Valeria; Szyld, Daniel B. SIAM Journal on Numerical Analysis, Vol. 40, Issue 6 https://doi.org/10.1137/S0036142902401074	journal	January 2002
The Design and Use of Algorithms for Permuting Large Entries to the Diagonal of Sparse Matrices Duff, Iain S.; Koster, Jacko SIAM Journal on Matrix Analysis and Applications, Vol. 20, Issue 4 https://doi.org/10.1137/S0895479897317661	journal	January 1999
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs Karypis, George; Kumar, Vipin SIAM Journal on Scientific Computing, Vol. 20, Issue 1 https://doi.org/10.1137/S1064827595287997	journal	January 1998
Inexact Preconditioned Conjugate Gradient Method with Inner-Outer Iteration Golub, Gene H.; Ye, Qiang SIAM Journal on Scientific Computing, Vol. 21, Issue 4 https://doi.org/10.1137/S1064827597323415	journal	January 1999
Brook for GPUs: stream computing on graphics hardware Buck, Ian; Foley, Tim; Horn, Daniel ACM Transactions on Graphics, Vol. 23, Issue 3 https://doi.org/10.1145/1015706.1015800	journal	August 2004
Tools and techniques for performance---Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems) Langou, Julie; Langou, Julien; Luszczek, Piotr Proceedings of the 2006 ACM/IEEE conference on Supercomputing - SC '06 https://doi.org/10.1145/1188455.1188573	conference	January 2006
Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate Chen, Yanqing; Davis, Timothy A.; Hager, William W. ACM Transactions on Mathematical Software, Vol. 35, Issue 3 https://doi.org/10.1145/1391989.1391995	journal	October 2008
Optimizing symmetric dense matrix-vector multiplication on GPUs Nath, Rajib; Tomov, Stanimire; Dong, Tingxing "Tim" Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2063384.2063392	conference	November 2011
Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs Anzt, Hartwig; Dongarra, Jack; Flegar, Goran Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores https://doi.org/10.1145/3026937.3026940	conference	February 2017
High-performance Cholesky factorization for GPU-only execution Haidar, Azzam; Abdelfatah, Ahmad; Tomov, Stanimire Proceedings of the General Purpose GPUs https://doi.org/10.1145/3038228.3038237	conference	February 2017
Algorithm 980 Yeralan, Sencer Nuri; Davis, Timothy A.; Sid-Lakhdar, Wissam M. ACM Transactions on Mathematical Software, Vol. 44, Issue 2 https://doi.org/10.1145/3065870	journal	August 2017
Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire Proceedings of the International Conference on Supercomputing - ICS '17 https://doi.org/10.1145/3079079.3079103	conference	January 2017
Iterative Refinement in Floating Point Moler, Cleve B. Journal of the ACM, Vol. 14, Issue 2 https://doi.org/10.1145/321386.321394	journal	April 1967
Uncertainty Quantification of Metal Additive Manufacturing Processing Conditions Through the use of Exascale Computing Carson, Robert; Rolchigo, Matt; Coleman, John Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis https://doi.org/10.1145/3624062.3624103	conference	November 2023
SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems Li, Xiaoye S.; Demmel, James W. ACM Transactions on Mathematical Software, Vol. 29, Issue 2 https://doi.org/10.1145/779359.779361	journal	June 2003
Algorithm 832: UMFPACK V4.3---an unsymmetric-pattern multifrontal method Davis, Timothy A. ACM Transactions on Mathematical Software, Vol. 30, Issue 2 https://doi.org/10.1145/992200.992206	journal	June 2004
Umpire: Application-focused management and coordination of complex hierarchical memory Beckingsale, D. A.; McFadden, M. J.; Dahm, J. P. S. IBM Journal of Research and Development, Vol. 64, Issue 3/4 https://doi.org/10.1147/JRD.2019.2954403	journal	May 2020
Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems Buttari, Alfredo; Dongarra, Jack; Langou, Julie The International Journal of High Performance Computing Applications, Vol. 21, Issue 4 https://doi.org/10.1177/1094342007084026	journal	November 2007
An Improved Magma Gemm For Fermi Graphics Processing Units Nath, Rajib; Tomov, Stanimire; Dongarra, Jack The International Journal of High Performance Computing Applications, Vol. 24, Issue 4 https://doi.org/10.1177/1094342010385729	journal	September 2010
Batched matrix computations on hardware accelerators based on GPUs Haidar, Azzam; Dong, Tingxing; Luszczek, Piotr The International Journal of High Performance Computing Applications, Vol. 29, Issue 2 https://doi.org/10.1177/1094342014567546	journal	April 2014
Acceleration of GPU-based Krylov solvers via data transfer reduction Anzt, Hartwig; Tomov, Stanimire; Luszczek, Piotr The International Journal of High Performance Computing Applications, Vol. 29, Issue 3 https://doi.org/10.1177/1094342015580139	journal	April 2015
Scalability of high-performance PDE solvers Fischer, Paul; Min, Misun; Rathnayake, Thilina The International Journal of High Performance Computing Applications, Vol. 34, Issue 5 https://doi.org/10.1177/1094342020915762	journal	June 2020
Efficient exascale discretizations: High-order finite element methods Kolev, Tzanio; Fischer, Paul; Min, Misun The International Journal of High Performance Computing Applications, Vol. 35, Issue 6 https://doi.org/10.1177/10943420211020803	journal	June 2021
ExaAM: Metal additive manufacturing simulation at the fidelity of the microstructure Turner, John A.; Belak, James; Barton, Nathan The International Journal of High Performance Computing Applications, Vol. 36, Issue 1 https://doi.org/10.1177/10943420211042558	journal	January 2022
Matrix-free approaches for GPU acceleration of a high-order finite element hydrodynamics application using MFEM, Umpire, and RAJA Vargas, Arturo; Stitt, Thomas M.; Weiss, Kenneth The International Journal of High Performance Computing Applications, Vol. 36, Issue 4 https://doi.org/10.1177/10943420221100262	journal	May 2022
Vectorization of a Multiprocessor Multifrontal Code Amestoy, Patrick R.; Duff, lain S. The International Journal of Supercomputing Applications, Vol. 3, Issue 3 https://doi.org/10.1177/109434208900300303	journal	September 1989
libCEED: Fast algebra for high-order element-based discretizations Brown, Jed; Abdelfattah, Ahmad; Barra, Valeria Journal of Open Source Software, Vol. 6, Issue 63 https://doi.org/10.21105/joss.02945	journal	July 2021

Similar Records

ECP ST Project 2.3.1.06-STPM08-RAJA (Final Report)

Technical Report · Thu Jan 23 23:00:00 EST 2020 · OSTI ID:1597604

A graphics processing unit accelerated sparse direct solver and preconditioner with block low rank compression

Journal Article · Sun Sep 29 20:00:00 EDT 2024 · International Journal of High Performance Computing Applications · OSTI ID:2499469

Related Subjects

97 MATHEMATICS AND COMPUTING
GPU computing
The MAGMA library
numerical linear algebra
performance portability

MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures

Citation Formats

References (68)

Similar Records

Related Subjects