skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Acceleration of GPU-based Krylov solvers via data transfer reduction

Abstract

Krylov subspace iterative solvers are often the method of choice when solving large sparse linear systems. At the same time, hardware accelerators such as graphics processing units continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to reduce certain data communications, and hence fail to leverage the full potential of the accelerator. In this study, we target the acceleration of Krylov subspace iterative methods for graphics processing units, and in particular the Biconjugate Gradient Stabilized solver that significant improvement can be achieved by reformulating the method to reduce data-communications through application-specific kernels instead of using the generic BLAS kernels, e.g. as provided by NVIDIA’s cuBLAS library, and by designing a graphics processing unit specific sparse matrix-vector product kernel that is able to more efficiently use the graphics processing unit’s computing power. Furthermore, we derive a model estimating the performance improvement, and use experimental data to validate the expected runtime savings. Finally, considering that the derived implementation achieves significantly higher performance, we assert that similar optimizations addressingmore » algorithm structure, as well as sparse matrix-vector, are crucial for the subsequent development of high-performance graphics processing units accelerated Krylov subspace iterative methods.« less

Authors:
 [1];  [1];  [1];  [2];  [3]
  1. Univ. of Tennessee, Knoxville, TN (United States). Innovative Computing Lab.
  2. Swiss National Supercomputing Centre (CSCS), Lugano (Switzerland)
  3. Univ. of Tennessee, Knoxville, TN (United States). Innovative Computing Lab.; Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Manchester (United Kingdom)
Publication Date:
Research Org.:
Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF); Russian Scientific Fund (Russian Federation)
Contributing Org.:
Swiss National Supercomputing Centre (CSCS), Lugano (Switzerland); Univ. of Manchester (United Kingdom)
OSTI Identifier:
1361293
Grant/Contract Number:  
SC0010042; ACI-1339822; N14-11-00190
Resource Type:
Accepted Manuscript
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Volume: 29; Journal Issue: 3; Journal ID: ISSN 1094-3420
Publisher:
SAGE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Krylov Subspace Methods; Iterative Solvers; Sparse Linear Systems; Graphics Processing Units; BiCGSTAB

Citation Formats

Anzt, Hartwig, Tomov, Stanimire, Luszczek, Piotr, Sawyer, William, and Dongarra, Jack. Acceleration of GPU-based Krylov solvers via data transfer reduction. United States: N. p., 2015. Web. doi:10.1177/1094342015580139.
Anzt, Hartwig, Tomov, Stanimire, Luszczek, Piotr, Sawyer, William, & Dongarra, Jack. Acceleration of GPU-based Krylov solvers via data transfer reduction. United States. doi:10.1177/1094342015580139.
Anzt, Hartwig, Tomov, Stanimire, Luszczek, Piotr, Sawyer, William, and Dongarra, Jack. Wed . "Acceleration of GPU-based Krylov solvers via data transfer reduction". United States. doi:10.1177/1094342015580139. https://www.osti.gov/servlets/purl/1361293.
@article{osti_1361293,
title = {Acceleration of GPU-based Krylov solvers via data transfer reduction},
author = {Anzt, Hartwig and Tomov, Stanimire and Luszczek, Piotr and Sawyer, William and Dongarra, Jack},
abstractNote = {Krylov subspace iterative solvers are often the method of choice when solving large sparse linear systems. At the same time, hardware accelerators such as graphics processing units continue to offer significant floating point performance gains for matrix and vector computations through easy-to-use libraries of computational kernels. However, as these libraries are usually composed of a well optimized but limited set of linear algebra operations, applications that use them often fail to reduce certain data communications, and hence fail to leverage the full potential of the accelerator. In this study, we target the acceleration of Krylov subspace iterative methods for graphics processing units, and in particular the Biconjugate Gradient Stabilized solver that significant improvement can be achieved by reformulating the method to reduce data-communications through application-specific kernels instead of using the generic BLAS kernels, e.g. as provided by NVIDIA’s cuBLAS library, and by designing a graphics processing unit specific sparse matrix-vector product kernel that is able to more efficiently use the graphics processing unit’s computing power. Furthermore, we derive a model estimating the performance improvement, and use experimental data to validate the expected runtime savings. Finally, considering that the derived implementation achieves significantly higher performance, we assert that similar optimizations addressing algorithm structure, as well as sparse matrix-vector, are crucial for the subsequent development of high-performance graphics processing units accelerated Krylov subspace iterative methods.},
doi = {10.1177/1094342015580139},
journal = {International Journal of High Performance Computing Applications},
number = 3,
volume = 29,
place = {United States},
year = {2015},
month = {4}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 6 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems
journal, November 2007

  • Buttari, Alfredo; Dongarra, Jack; Langou, Julie
  • The International Journal of High Performance Computing Applications, Vol. 21, Issue 4
  • DOI: 10.1177/1094342007084026

Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication
conference, May 2011

  • Buluç, Aydin; Williams, Samuel; Oliker, Leonid
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2011.73

Model-driven autotuning of sparse matrix-vector multiply on GPUs
journal, May 2010


Improving the Performance of CA-GMRES on Multicores with Multiple GPUs
conference, May 2014

  • Yamazaki, Ichitaro; Anzt, Hartwig; Tomov, Stanimire
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.48

Model-driven autotuning of sparse matrix-vector multiply on GPUs
conference, January 2010

  • Choi, Jee W.; Singh, Amik; Vuduc, Richard W.
  • Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '10
  • DOI: 10.1145/1693453.1693471

Accelerating scientific computations with mixed precision algorithms
journal, December 2009

  • Baboulin, Marc; Buttari, Alfredo; Dongarra, Jack
  • Computer Physics Communications, Vol. 180, Issue 12
  • DOI: 10.1016/j.cpc.2008.11.005

Performance and Energy-Aware Characterization of the Sparse Matrix-Vector Multiplication on Multithreaded Architectures
conference, September 2014

  • Malossi, A. C. I.; Ineichen, Y.; Bekas, C.
  • 2014 43nd International Conference on Parallel Processing Workshops (ICCPW), 2014 43rd International Conference on Parallel Processing Workshops
  • DOI: 10.1109/icppw.2014.30

A Fan-In Algorithm for Distributed Sparse Numerical Factorization
journal, May 1990

  • Ashcraft, Cleve; Eisenstat, Stanley C.; Liu, Joseph W. H.
  • SIAM Journal on Scientific and Statistical Computing, Vol. 11, Issue 3
  • DOI: 10.1137/0911033

GPU-accelerated preconditioned iterative linear solvers
journal, October 2012


Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems
journal, March 1992

  • van der Vorst, H. A.
  • SIAM Journal on Scientific and Statistical Computing, Vol. 13, Issue 2
  • DOI: 10.1137/0913035

Optimizing Krylov Subspace Solvers on Graphics Processing Units
conference, May 2014

  • Anzt, Hartwig; Sawyer, William; Tomov, Stanimire
  • 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW)
  • DOI: 10.1109/ipdpsw.2014.107

Reformulated Conjugate Gradient for the Energy-Aware Solution of Linear Systems on GPUs
conference, October 2013

  • Aliaga, Jose I.; Perez, Joaquin; Quintana-Orti, Enrique S.
  • 2013 42nd International Conference on Parallel Processing (ICPP)
  • DOI: 10.1109/ICPP.2013.41

Methods of conjugate gradients for solving linear systems
journal, December 1952

  • Hestenes, M. R.; Stiefel, E.
  • Journal of Research of the National Bureau of Standards, Vol. 49, Issue 6
  • DOI: 10.6028/jres.049.044

Energy efficiency and performance frontiers for sparse computations on GPU supercomputers
conference, January 2015

  • Anzt, Hartwig; Tomov, Stanimire; Dongarra, Jack
  • Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM '15
  • DOI: 10.1145/2712386.2712387

Optimizing Krylov Subspace Solvers on Graphics Processing Units
conference, May 2014

  • Anzt, Hartwig; Sawyer, William; Tomov, Stanimire
  • 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW)
  • DOI: 10.1109/IPDPSW.2014.107

Performance and Energy-Aware Characterization of the Sparse Matrix-Vector Multiplication on Multithreaded Architectures
conference, September 2014

  • Malossi, A. C. I.; Ineichen, Y.; Bekas, C.
  • 2014 43nd International Conference on Parallel Processing Workshops (ICCPW), 2014 43rd International Conference on Parallel Processing Workshops
  • DOI: 10.1109/ICPPW.2014.30

    Works referencing / citing this record:

    A review of CUDA optimization techniques and tools for structured grid computing
    journal, July 2019