Acceleration of GPUbased Krylov solvers via data transfer reduction
Abstract
Krylov subspace iterative solvers are often the method of choice when solving large sparse linear systems. At the same time, hardware accelerators such as graphics processing units continue to offer significant floating point performance gains for matrix and vector computations through easytouse libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to reduce certain data communications, and hence fail to leverage the full potential of the accelerator. In this study, we target the acceleration of Krylov subspace iterative methods for graphics processing units, and in particular the Biconjugate Gradient Stabilized solver that significant improvement can be achieved by reformulating the method to reduce datacommunications through applicationspecific kernels instead of using the generic BLAS kernels, e.g. as provided by NVIDIA’s cuBLAS library, and by designing a graphics processing unit specific sparse matrixvector product kernel that is able to more efficiently use the graphics processing unit’s computing power. Furthermore, we derive a model estimating the performance improvement, and use experimental data to validate the expected runtime savings. Finally, considering that the derived implementation achieves significantly higher performance, we assert that similar optimizations addressingmore »
 Authors:

 Univ. of Tennessee, Knoxville, TN (United States). Innovative Computing Lab.
 Swiss National Supercomputing Centre (CSCS), Lugano (Switzerland)
 Univ. of Tennessee, Knoxville, TN (United States). Innovative Computing Lab.; Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Manchester (United Kingdom)
 Publication Date:
 Research Org.:
 Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
 Sponsoring Org.:
 USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF); Russian Scientific Fund (Russian Federation)
 Contributing Org.:
 Swiss National Supercomputing Centre (CSCS), Lugano (Switzerland); Univ. of Manchester (United Kingdom)
 OSTI Identifier:
 1361293
 Grant/Contract Number:
 SC0010042; ACI1339822; N141100190
 Resource Type:
 Accepted Manuscript
 Journal Name:
 International Journal of High Performance Computing Applications
 Additional Journal Information:
 Journal Volume: 29; Journal Issue: 3; Journal ID: ISSN 10943420
 Publisher:
 SAGE
 Country of Publication:
 United States
 Language:
 English
 Subject:
 97 MATHEMATICS AND COMPUTING; Krylov Subspace Methods; Iterative Solvers; Sparse Linear Systems; Graphics Processing Units; BiCGSTAB
Citation Formats
Anzt, Hartwig, Tomov, Stanimire, Luszczek, Piotr, Sawyer, William, and Dongarra, Jack. Acceleration of GPUbased Krylov solvers via data transfer reduction. United States: N. p., 2015.
Web. doi:10.1177/1094342015580139.
Anzt, Hartwig, Tomov, Stanimire, Luszczek, Piotr, Sawyer, William, & Dongarra, Jack. Acceleration of GPUbased Krylov solvers via data transfer reduction. United States. doi:10.1177/1094342015580139.
Anzt, Hartwig, Tomov, Stanimire, Luszczek, Piotr, Sawyer, William, and Dongarra, Jack. Wed .
"Acceleration of GPUbased Krylov solvers via data transfer reduction". United States. doi:10.1177/1094342015580139. https://www.osti.gov/servlets/purl/1361293.
@article{osti_1361293,
title = {Acceleration of GPUbased Krylov solvers via data transfer reduction},
author = {Anzt, Hartwig and Tomov, Stanimire and Luszczek, Piotr and Sawyer, William and Dongarra, Jack},
abstractNote = {Krylov subspace iterative solvers are often the method of choice when solving large sparse linear systems. At the same time, hardware accelerators such as graphics processing units continue to offer significant floating point performance gains for matrix and vector computations through easytouse libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to reduce certain data communications, and hence fail to leverage the full potential of the accelerator. In this study, we target the acceleration of Krylov subspace iterative methods for graphics processing units, and in particular the Biconjugate Gradient Stabilized solver that significant improvement can be achieved by reformulating the method to reduce datacommunications through applicationspecific kernels instead of using the generic BLAS kernels, e.g. as provided by NVIDIA’s cuBLAS library, and by designing a graphics processing unit specific sparse matrixvector product kernel that is able to more efficiently use the graphics processing unit’s computing power. Furthermore, we derive a model estimating the performance improvement, and use experimental data to validate the expected runtime savings. Finally, considering that the derived implementation achieves significantly higher performance, we assert that similar optimizations addressing algorithm structure, as well as sparse matrixvector, are crucial for the subsequent development of highperformance graphics processing units accelerated Krylov subspace iterative methods.},
doi = {10.1177/1094342015580139},
journal = {International Journal of High Performance Computing Applications},
number = 3,
volume = 29,
place = {United States},
year = {2015},
month = {4}
}
Web of Science
Works referenced in this record:
Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems
journal, November 2007
 Buttari, Alfredo; Dongarra, Jack; Langou, Julie
 The International Journal of High Performance Computing Applications, Vol. 21, Issue 4
ReducedBandwidth Multithreaded Algorithms for Sparse MatrixVector Multiplication
conference, May 2011
 Buluç, Aydin; Williams, Samuel; Oliker, Leonid
 Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium
Modeldriven autotuning of sparse matrixvector multiply on GPUs
journal, May 2010
 Choi, Jee W.; Singh, Amik; Vuduc, Richard W.
 ACM SIGPLAN Notices, Vol. 45, Issue 5
Improving the Performance of CAGMRES on Multicores with Multiple GPUs
conference, May 2014
 Yamazaki, Ichitaro; Anzt, Hartwig; Tomov, Stanimire
 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
Modeldriven autotuning of sparse matrixvector multiply on GPUs
conference, January 2010
 Choi, Jee W.; Singh, Amik; Vuduc, Richard W.
 Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming  PPoPP '10
Accelerating scientific computations with mixed precision algorithms
journal, December 2009
 Baboulin, Marc; Buttari, Alfredo; Dongarra, Jack
 Computer Physics Communications, Vol. 180, Issue 12
Performance and EnergyAware Characterization of the Sparse MatrixVector Multiplication on Multithreaded Architectures
conference, September 2014
 Malossi, A. C. I.; Ineichen, Y.; Bekas, C.
 2014 43nd International Conference on Parallel Processing Workshops (ICCPW), 2014 43rd International Conference on Parallel Processing Workshops
A FanIn Algorithm for Distributed Sparse Numerical Factorization
journal, May 1990
 Ashcraft, Cleve; Eisenstat, Stanley C.; Liu, Joseph W. H.
 SIAM Journal on Scientific and Statistical Computing, Vol. 11, Issue 3
GPUaccelerated preconditioned iterative linear solvers
journal, October 2012
 Li, Ruipeng; Saad, Yousef
 The Journal of Supercomputing, Vol. 63, Issue 2
BiCGSTAB: A Fast and Smoothly Converging Variant of BiCG for the Solution of Nonsymmetric Linear Systems
journal, March 1992
 van der Vorst, H. A.
 SIAM Journal on Scientific and Statistical Computing, Vol. 13, Issue 2
Optimizing Krylov Subspace Solvers on Graphics Processing Units
conference, May 2014
 Anzt, Hartwig; Sawyer, William; Tomov, Stanimire
 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW)
Reformulated Conjugate Gradient for the EnergyAware Solution of Linear Systems on GPUs
conference, October 2013
 Aliaga, Jose I.; Perez, Joaquin; QuintanaOrti, Enrique S.
 2013 42nd International Conference on Parallel Processing (ICPP)
Methods of conjugate gradients for solving linear systems
journal, December 1952
 Hestenes, M. R.; Stiefel, E.
 Journal of Research of the National Bureau of Standards, Vol. 49, Issue 6
Energy efficiency and performance frontiers for sparse computations on GPU supercomputers
conference, January 2015
 Anzt, Hartwig; Tomov, Stanimire; Dongarra, Jack
 Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores  PMAM '15
Optimizing Krylov Subspace Solvers on Graphics Processing Units
conference, May 2014
 Anzt, Hartwig; Sawyer, William; Tomov, Stanimire
 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW)
Performance and EnergyAware Characterization of the Sparse MatrixVector Multiplication on Multithreaded Architectures
conference, September 2014
 Malossi, A. C. I.; Ineichen, Y.; Bekas, C.
 2014 43nd International Conference on Parallel Processing Workshops (ICCPW), 2014 43rd International Conference on Parallel Processing Workshops
Works referencing / citing this record:
A review of CUDA optimization techniques and tools for structured grid computing
journal, July 2019
 AlMouhamed, Mayez A.; Khan, Ayaz H.; Mohammad, Nazeeruddin
 Computing, Vol. 102, Issue 4