Acceleration of GPU-based Krylov solvers via data transfer reduction

Anzt, Hartwig; Tomov, Stanimire; Luszczek, Piotr; Sawyer, William; Dongarra, Jack

doi:10.1177/1094342015580139

Title: Acceleration of GPU-based Krylov solvers via data transfer reduction

Abstract

Krylov subspace iterative solvers are often the method of choice when solving large sparse linear systems. At the same time, hardware accelerators such as graphics processing units continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to reduce certain data communications, and hence fail to leverage the full potential of the accelerator. In this study, we target the acceleration of Krylov subspace iterative methods for graphics processing units, and in particular the Biconjugate Gradient Stabilized solver that significant improvement can be achieved by reformulating the method to reduce data-communications through application-specific kernels instead of using the generic BLAS kernels, e.g. as provided by NVIDIA’s cuBLAS library, and by designing a graphics processing unit specific sparse matrix-vector product kernel that is able to more efficiently use the graphics processing unit’s computing power. Furthermore, we derive a model estimating the performance improvement, and use experimental data to validate the expected runtime savings. Finally, considering that the derived implementation achieves significantly higher performance, we assert that similar optimizations addressingmore »« less

Authors:

Anzt, Hartwig ^[1]; Tomov, Stanimire ^[1]; Luszczek, Piotr ^[1]; Sawyer, William ^[2]; Dongarra, Jack ^[3]

Univ. of Tennessee, Knoxville, TN (United States). Innovative Computing Lab.
Swiss National Supercomputing Centre (CSCS), Lugano (Switzerland)
Univ. of Tennessee, Knoxville, TN (United States). Innovative Computing Lab.; Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Manchester (United Kingdom)

Publication Date:: Wed Apr 08 00:00:00 EDT 2015

Research Org.:: Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Org.:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF); Russian Scientific Fund (Russian Federation)

Contributing Org.:: Swiss National Supercomputing Centre (CSCS), Lugano (Switzerland); Univ. of Manchester (United Kingdom)

OSTI Identifier:: 1361293

Grant/Contract Number:: SC0010042; ACI-1339822; N14-11-00190

Resource Type:: Accepted Manuscript

Journal Name:: International Journal of High Performance Computing Applications

Additional Journal Information:: Journal Volume: 29; Journal Issue: 3; Journal ID: ISSN 1094-3420

Publisher:: SAGE

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; Krylov Subspace Methods; Iterative Solvers; Sparse Linear Systems; Graphics Processing Units; BiCGSTAB

Citation Formats


                    Anzt, Hartwig, Tomov, Stanimire, Luszczek, Piotr, Sawyer, William, and Dongarra, Jack. Acceleration of GPU-based Krylov solvers via data transfer reduction.  United States: N. p., 2015. 
Web.  doi:10.1177/1094342015580139.

Copy to clipboard


                    Anzt, Hartwig, Tomov, Stanimire, Luszczek, Piotr, Sawyer, William, & Dongarra, Jack. Acceleration of GPU-based Krylov solvers via data transfer reduction.  United States.  https://doi.org/10.1177/1094342015580139

Copy to clipboard


                    Anzt, Hartwig, Tomov, Stanimire, Luszczek, Piotr, Sawyer, William, and Dongarra, Jack. Wed .  
"Acceleration of GPU-based Krylov solvers via data transfer reduction".  United States.  https://doi.org/10.1177/1094342015580139.  https://www.osti.gov/servlets/purl/1361293.

Copy to clipboard


                    
@article{osti_1361293,

  title        = {Acceleration of GPU-based Krylov solvers via data transfer reduction},

  author       = {Anzt, Hartwig and Tomov, Stanimire and Luszczek, Piotr and Sawyer, William and Dongarra, Jack},

  abstractNote = {Krylov subspace iterative solvers are often the method of choice when solving large sparse linear systems. At the same time, hardware accelerators such as graphics processing units continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to reduce certain data communications, and hence fail to leverage the full potential of the accelerator. In this study, we target the acceleration of Krylov subspace iterative methods for graphics processing units, and in particular the Biconjugate Gradient Stabilized solver that significant improvement can be achieved by reformulating the method to reduce data-communications through application-specific kernels instead of using the generic BLAS kernels, e.g. as provided by NVIDIA’s cuBLAS library, and by designing a graphics processing unit specific sparse matrix-vector product kernel that is able to more efficiently use the graphics processing unit’s computing power. Furthermore, we derive a model estimating the performance improvement, and use experimental data to validate the expected runtime savings. Finally, considering that the derived implementation achieves significantly higher performance, we assert that similar optimizations addressing algorithm structure, as well as sparse matrix-vector, are crucial for the subsequent development of high-performance graphics processing units accelerated Krylov subspace iterative methods.},

  doi          = {10.1177/1094342015580139},

  journal      = {International Journal of High Performance Computing Applications},

  number       = 3,

  volume       = 29,

  place        = {United States},

  year         = {Wed Apr 08 00:00:00 EDT 2015},

  month        = {Wed Apr 08 00:00:00 EDT 2015}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1177/1094342015580139

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 11 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems
journal, November 2007

Buttari, Alfredo; Dongarra, Jack; Langou, Julie
The International Journal of High Performance Computing Applications, Vol. 21, Issue 4
DOI: 10.1177/1094342007084026

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication
conference, May 2011

Buluç, Aydin; Williams, Samuel; Oliker, Leonid
Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium
DOI: 10.1109/IPDPS.2011.73

CPU and GPU Performance of Large Scale Numerical Simulations in Geophysics
book, January 2014

Dorostkar, Ali; Lukarski, Dimitar; Lund, Björn
Lecture Notes in Computer Science
DOI: 10.1007/978-3-319-14325-5_2

Iterative Methods for Sparse Linear Systems
book, January 2003

Saad, Yousef
DOI: 10.1137/1.9780898718003

Model-driven autotuning of sparse matrix-vector multiply on GPUs
journal, May 2010

Choi, Jee W.; Singh, Amik; Vuduc, Richard W.
ACM SIGPLAN Notices, Vol. 45, Issue 5
DOI: 10.1145/1837853.1693471

Improving the Performance of CA-GMRES on Multicores with Multiple GPUs
conference, May 2014

Yamazaki, Ichitaro; Anzt, Hartwig; Tomov, Stanimire
2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
DOI: 10.1109/IPDPS.2014.48

Model-driven autotuning of sparse matrix-vector multiply on GPUs
conference, January 2010

Choi, Jee W.; Singh, Amik; Vuduc, Richard W.
Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '10
DOI: 10.1145/1693453.1693471

Accelerating scientific computations with mixed precision algorithms
journal, December 2009

Baboulin, Marc; Buttari, Alfredo; Dongarra, Jack
Computer Physics Communications, Vol. 180, Issue 12
DOI: 10.1016/j.cpc.2008.11.005

Performance and Energy-Aware Characterization of the Sparse Matrix-Vector Multiplication on Multithreaded Architectures
conference, September 2014

Malossi, A. C. I.; Ineichen, Y.; Bekas, C.
2014 43nd International Conference on Parallel Processing Workshops (ICCPW), 2014 43rd International Conference on Parallel Processing Workshops
DOI: 10.1109/icppw.2014.30

A Fan-In Algorithm for Distributed Sparse Numerical Factorization
journal, May 1990

Ashcraft, Cleve; Eisenstat, Stanley C.; Liu, Joseph W. H.
SIAM Journal on Scientific and Statistical Computing, Vol. 11, Issue 3
DOI: 10.1137/0911033

Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures
book, January 2010

Monakov, Alexander; Lokhmotov, Anton; Avetisyan, Arutyun
High Performance Embedded Architectures and Compilers
DOI: 10.1007/978-3-642-11515-8_10

GPU-accelerated preconditioned iterative linear solvers
journal, October 2012

Li, Ruipeng; Saad, Yousef
The Journal of Supercomputing, Vol. 63, Issue 2
DOI: 10.1007/s11227-012-0825-3

Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems
journal, March 1992

van der Vorst, H. A.
SIAM Journal on Scientific and Statistical Computing, Vol. 13, Issue 2
DOI: 10.1137/0913035

Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods
book, January 1994

Barrett, Richard; Berry, Michael; Chan, Tony F.
DOI: 10.1137/1.9781611971538

Optimizing Krylov Subspace Solvers on Graphics Processing Units
conference, May 2014

Anzt, Hartwig; Sawyer, William; Tomov, Stanimire
2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW)
DOI: 10.1109/ipdpsw.2014.107

Reformulated Conjugate Gradient for the Energy-Aware Solution of Linear Systems on GPUs
conference, October 2013

Aliaga, Jose I.; Perez, Joaquin; Quintana-Orti, Enrique S.
2013 42nd International Conference on Parallel Processing (ICPP)
DOI: 10.1109/ICPP.2013.41

Methods of conjugate gradients for solving linear systems
journal, December 1952

Hestenes, M. R.; Stiefel, E.
Journal of Research of the National Bureau of Standards, Vol. 49, Issue 6
DOI: 10.6028/jres.049.044

Energy efficiency and performance frontiers for sparse computations on GPU supercomputers
conference, January 2015

Anzt, Hartwig; Tomov, Stanimire; Dongarra, Jack
Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM '15
DOI: 10.1145/2712386.2712387

Finite elements
book, April 2020

Hinch, E. J.
Think Before You Compute
DOI: 10.1017/9781108855297.008

Finite Elements
book, January 2008

,
Numerical Approximation Methods for Elliptic Boundary Value Problems
DOI: 10.1007/978-0-387-68805-3_9

Finite Elements
book, January 2008

Gekeler, Eckart W.
Mathematical Methods for Mechanics
DOI: 10.1007/978-3-540-69279-9_9

Sparse Matrix-Vector Multiplication on Multicore and Accelerators
book, December 2010

Williams, Samuel; Bell, Nathan; Choi, Jee Whan
Scientific Computing with Multicore and Accelerators
DOI: 10.1201/b10376-15

Finite Elements
book, January 1980

Pian, Theodore H. H.
Variational Methods in the Mechanics of Solids
DOI: 10.1016/b978-0-08-024728-1.50031-8

Accelerating Scientific Computations with Mixed Precision Algorithms
text, January 2008

Baboulin, Marc; Buttari, Alfredo; Dongarra, Jack
arXiv
DOI: 10.48550/arxiv.0808.2794

Works referencing / citing this record:

A review of CUDA optimization techniques and tools for structured grid computing
journal, July 2019

Al-Mouhamed, Mayez A.; Khan, Ayaz H.; Mohammad, Nazeeruddin
Computing, Vol. 102, Issue 4
DOI: 10.1007/s00607-019-00744-1

Similar Records in DOE PAGES and OSTI.GOV collections:

Batched matrix computations on hardware accelerators based on GPUs

Journal Article Haidar, Azzam ; Dong, Tingxing ; Luszczek, Piotr ; ... - International Journal of High Performance Computing Applications

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for an effective approach to develop energy-efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This study, consequently, describes the development of the most common, one-sidedmore »« less
Cited by 32
https://doi.org/10.1177/1094342014567546

Full Text Available
Performance Portable Batched Sparse Linear Solvers

Journal Article Liegeois, Kim ; Rajamanickam, Sivasankaran ; Berger-Vergiat, Luc - IEEE Transactions on Parallel and Distributed Systems

Solving large number of small linear systems is increasingly becoming a bottleneck in computational science applications. While dense linear solvers for such systems have been studied before, batched sparse linear solvers are just starting to emerge. In this paper, we discuss algorithms for solving batched sparse linear systems and their implementation in the Kokkos Kernels library. The new algorithms are performance portable and map well to the hierarchical parallelism available in modern accelerator architectures. The sparse matrix vector product (SPMV) kernel is the main performance bottleneck of the Krylov solvers we implement in this work. The implementation of the batchedmore »« less
https://doi.org/10.1109/TPDS.2023.3249110
A fast band–Krylov eigensolver for macromolecular functional motion simulation on multicore architectures and graphics processors

Journal Article Aliaga, José I., E-mail: aliaga@uji.es ; Alonso, Pedro ; Badía, José M. ; ... - Journal of Computational Physics

We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousandsmore »« less
https://doi.org/10.1016/J.JCP.2016.01.007
Aztec user`s guide. Version 1

Technical Report Hutchinson, S A ; Shadid, J N ; Tuminaro, R S

Aztec is an iterative library that greatly simplifies the parallelization process when solving the linear systems of equations Ax = b where A is a user supplied n x n sparse matrix, b is a user supplied vector of length n and x is a vector of length n to be computed. Aztec is intended as a software tool for users who want to avoid cumbersome parallel programming details but who have large sparse linear systems which require an efficiently utilized parallel processing system. A collection of data transformation tools are provided that allow for easy creation of distributed sparsemore »« less
https://doi.org/10.2172/135550

Full Text Available
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster

Conference Allada, Veerendra, Benjegerdes, Troy ; Bode, Brett

Commodity clusters augmented with application accelerators are evolving as competitive high performance computing systems. The Graphical Processing Unit (GPU) with a very high arithmetic density and performance per price ratio is a good platform for the scientific application acceleration. In addition to the interconnect bottlenecks among the cluster compute nodes, the cost of memory copies between the host and the GPU device have to be carefully amortized to improve the overall efficiency of the application. Scientific applications also rely on efficient implementation of the BAsic Linear Algebra Subroutines (BLAS), among which the General Matrix Multiply (GEMM) is considered as themore »« less
Full Text Available

Similar Records

Title: Acceleration of GPU-based Krylov solvers via data transfer reduction

Abstract

Citation Formats

Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems journal, November 2007

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication conference, May 2011

CPU and GPU Performance of Large Scale Numerical Simulations in Geophysics book, January 2014

Iterative Methods for Sparse Linear Systems book, January 2003

Model-driven autotuning of sparse matrix-vector multiply on GPUs journal, May 2010

Improving the Performance of CA-GMRES on Multicores with Multiple GPUs conference, May 2014

Model-driven autotuning of sparse matrix-vector multiply on GPUs conference, January 2010

Accelerating scientific computations with mixed precision algorithms journal, December 2009

Performance and Energy-Aware Characterization of the Sparse Matrix-Vector Multiplication on Multithreaded Architectures conference, September 2014

A Fan-In Algorithm for Distributed Sparse Numerical Factorization journal, May 1990

Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures book, January 2010

GPU-accelerated preconditioned iterative linear solvers journal, October 2012

Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems journal, March 1992

Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods book, January 1994

Optimizing Krylov Subspace Solvers on Graphics Processing Units conference, May 2014

Reformulated Conjugate Gradient for the Energy-Aware Solution of Linear Systems on GPUs conference, October 2013

Methods of conjugate gradients for solving linear systems journal, December 1952

Energy efficiency and performance frontiers for sparse computations on GPU supercomputers conference, January 2015

Finite elements book, April 2020

Finite Elements book, January 2008

Finite Elements book, January 2008

Sparse Matrix-Vector Multiplication on Multicore and Accelerators book, December 2010

Finite Elements book, January 1980

Accelerating Scientific Computations with Mixed Precision Algorithms text, January 2008

A review of CUDA optimization techniques and tools for structured grid computing journal, July 2019

Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems
journal, November 2007

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication
conference, May 2011

CPU and GPU Performance of Large Scale Numerical Simulations in Geophysics
book, January 2014

Iterative Methods for Sparse Linear Systems
book, January 2003

Model-driven autotuning of sparse matrix-vector multiply on GPUs
journal, May 2010

Improving the Performance of CA-GMRES on Multicores with Multiple GPUs
conference, May 2014

Model-driven autotuning of sparse matrix-vector multiply on GPUs
conference, January 2010

Accelerating scientific computations with mixed precision algorithms
journal, December 2009

Performance and Energy-Aware Characterization of the Sparse Matrix-Vector Multiplication on Multithreaded Architectures
conference, September 2014

A Fan-In Algorithm for Distributed Sparse Numerical Factorization
journal, May 1990

Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures
book, January 2010

GPU-accelerated preconditioned iterative linear solvers
journal, October 2012

Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems
journal, March 1992

Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods
book, January 1994

Optimizing Krylov Subspace Solvers on Graphics Processing Units
conference, May 2014

Reformulated Conjugate Gradient for the Energy-Aware Solution of Linear Systems on GPUs
conference, October 2013

Methods of conjugate gradients for solving linear systems
journal, December 1952

Energy efficiency and performance frontiers for sparse computations on GPU supercomputers
conference, January 2015

Finite elements
book, April 2020

Finite Elements
book, January 2008

Finite Elements
book, January 2008

Sparse Matrix-Vector Multiplication on Multicore and Accelerators
book, December 2010

Finite Elements
book, January 1980

Accelerating Scientific Computations with Mixed Precision Algorithms
text, January 2008

A review of CUDA optimization techniques and tools for structured grid computing
journal, July 2019