DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs

Abstract

Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Here, using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.

Authors:
 [1];  [2]; ORCiD logo [3];  [4];  [5]
  1. NVIDIA Corporation, Santa Clara, CA (United States)
  2. Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
  3. Univ. of Utah, Salt Lake City, UT (United States). Dept. of Physics and Astronomy
  4. NVIDIA GmbH, Würselen (Germany)
  5. Boston Univ., MA (United States). Dept. of Physics
Publication Date:
Research Org.:
Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), High Energy Physics (HEP); USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1418147
Alternate Identifier(s):
OSTI ID: 1734408
Report Number(s):
arXiv:1710.09745; FERMILAB-PUB-17-592-CD
Journal ID: ISSN 0010-4655; 1632766
Grant/Contract Number:  
AC02-07CH11359
Resource Type:
Accepted Manuscript
Journal Name:
Computer Physics Communications
Additional Journal Information:
Journal Volume: 233; Journal Issue: C; Journal ID: ISSN 0010-4655
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS; 97 MATHEMATICS AND COMPUTING; Block solver; GPU

Citation Formats

Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs. United States: N. p., 2018. Web. doi:10.1016/j.cpc.2018.06.019.
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, & Weinberg, Evan. Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs. United States. https://doi.org/10.1016/j.cpc.2018.06.019
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Mon . "Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs". United States. https://doi.org/10.1016/j.cpc.2018.06.019. https://www.osti.gov/servlets/purl/1418147.
@article{osti_1418147,
title = {Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs},
author = {Clark, M. A. and Strelchenko, Alexei and Vaquero, Alejandro and Wagner, Mathias and Weinberg, Evan},
abstractNote = {Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Here, using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.},
doi = {10.1016/j.cpc.2018.06.019},
journal = {Computer Physics Communications},
number = C,
volume = 233,
place = {United States},
year = {2018},
month = {7}
}

Journal Article:

Citation Metrics:
Cited by: 5 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Variable Block CG Algorithms for Solving Large Sparse Symmetric Positive Definite Linear Systems on Parallel Computers, I: General Iterative Scheme
journal, October 1995

  • Nikishin, A. A.; Yeremin, A. Yu.
  • SIAM Journal on Matrix Analysis and Applications, Vol. 16, Issue 4
  • DOI: 10.1137/S0895479893247679

Solving lattice QCD systems of equations using mixed precision solvers on GPUs
journal, September 2010


Multiple right-hand side techniques for the numerical simulation of quasistatic electric and magnetic fields
journal, June 2008

  • Clemens, Markus; Helias, Moritz; Steinmetz, Thorsten
  • Journal of Computational and Applied Mathematics, Vol. 215, Issue 2
  • DOI: 10.1016/j.cam.2006.04.072

A review of block Krylov subspace methods for multisource electromagnetic modelling
journal, June 2015

  • Puzyrev, Vladimir; Cela, José María
  • Geophysical Journal International, Vol. 202, Issue 2
  • DOI: 10.1093/gji/ggv216

Block Krylov Recycling Algorithms for FETI-2LM Applied to 3-D Electromagnetic Wave Scattering and Radiation
journal, April 2017

  • Roux, Francois-Xavier; Barka, Andre
  • IEEE Transactions on Antennas and Propagation, Vol. 65, Issue 4
  • DOI: 10.1109/TAP.2017.2670541

Computing and Deflating Eigenvalues While Solving Multiple Right-Hand Side Linear Systems with an Application to Quantum Chromodynamics
journal, January 2010

  • Stathopoulos, Andreas; Orginos, Konstantinos
  • SIAM Journal on Scientific Computing, Vol. 32, Issue 1
  • DOI: 10.1137/080725532

Adaptive Multigrid Algorithm for Lattice QCD
journal, January 2008


Adaptive Multigrid Algorithm for the Lattice Wilson-Dirac Operator
journal, November 2010


Local coherence and deflation of the low quark modes in lattice QCD
journal, July 2007


Flexible Variants of Block Restarted GMRES Methods with Application to Geophysics
journal, January 2012

  • Calandra, Henri; Gratton, Serge; Langou, Julien
  • SIAM Journal on Scientific Computing, Vol. 34, Issue 2
  • DOI: 10.1137/10082364X

A breakdown-free block conjugate gradient method
journal, October 2016


Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals
journal, January 2000


Lattice QCD as a video game
journal, October 2007

  • Egri, Győző I.; Fodor, Zoltán; Hoelbling, Christian
  • Computer Physics Communications, Vol. 177, Issue 8
  • DOI: 10.1016/j.cpc.2007.06.005

Efficient Implementation of the Overlap Operator on Multi-GPUs
conference, July 2011

  • Alexandru, Andrei; Lujan, Michael; Pelissier, Craig
  • 2011 Symposium on Application Accelerators in High-Performance Computing (SAAHPC)
  • DOI: 10.1109/SAAHPC.2011.13

The Chroma Software System for Lattice QCD
journal, March 2005


A Framework for Lattice QCD Calculations on GPUs
conference, May 2014

  • Winter, F. T.; Clark, M. A.; Edwards, R. G.
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.112

The block conjugate gradient algorithm and related methods
journal, February 1980


Application of block Krylov subspace algorithms to the Wilson–Dirac equation with multiple right-hand sides in lattice QCD
journal, January 2010


Application of preconditioned block BiCGGR to the Wilson–Dirac equation with multiple right-hand sides in lattice QCD
journal, May 2010


Modified block BiCGSTAB for lattice QCD
journal, January 2012

  • Nakamura, Y.; Ishikawa, K. -I.; Kuramashi, Y.
  • Computer Physics Communications, Vol. 183, Issue 1
  • DOI: 10.1016/j.cpc.2011.08.010

A deflated conjugate gradient method for multiple right hand sides and multiple shifts
journal, November 2013


The QCD finite temperature transition and hybrid Monte Carlo
journal, February 1989


Hamiltonian formulation of Wilson's lattice gauge theories
journal, January 1975


Further Improvements to staggered quarks
journal, March 2004

  • Follana, Eduardo; Mason, Quentin; Davies, Christine
  • Nuclear Physics B - Proceedings Supplements, Vol. 129-130
  • DOI: 10.1016/S0920-5632(03)02610-0

Methods of conjugate gradients for solving linear systems
journal, December 1952

  • Hestenes, M. R.; Stiefel, E.
  • Journal of Research of the National Bureau of Standards, Vol. 49, Issue 6
  • DOI: 10.6028/jres.049.044

Roundoff error analysis of the CholeskyQR2 algorithm in an oblique inner product
journal, January 2016

  • Yamamoto, Yusaku; Nakatsukasa, Yuji; Yanagisawa, Yuka
  • JSIAM Letters, Vol. 8, Issue 0
  • DOI: 10.14495/jsiaml.8.5

Reliable updated residuals in hybrid Bi-CG methods
journal, June 1996

  • Sleijpen, G. L. G.; van der Vorst, H. A.
  • Computing, Vol. 56, Issue 2
  • DOI: 10.1007/BF02309342

Effective noise reduction techniques for disconnected loops in Lattice QCD
journal, September 2010

  • Bali, Gunnar S.; Collins, Sara; Schäfer, Andreas
  • Computer Physics Communications, Vol. 181, Issue 9
  • DOI: 10.1016/j.cpc.2010.05.008

Block s-step Krylov iterative methods
journal, January 2010

  • Chronopoulos, Anthony T.; Kucherov, Andrey B.
  • Numerical Linear Algebra with Applications, Vol. 17, Issue 1
  • DOI: 10.1002/nla.643

Lattice QCD as a video game
journal, October 2007

  • Egri, Győző I.; Fodor, Zoltán; Hoelbling, Christian
  • Computer Physics Communications, Vol. 177, Issue 8
  • DOI: 10.1016/j.cpc.2007.06.005

Amesos2 and Belos: Direct and Iterative Solvers for Large Sparse Linear Systems
journal, January 2012

  • Bavier, Eric; Hoemmen, Mark; Rajamanickam, Sivasankaran
  • Scientific Programming, Vol. 20, Issue 3
  • DOI: 10.1155/2012/243875

An Implementation of Block Conjugate Gradient Algorithm on CPU-GPU Processors
text, January 2014


Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs.
text, January 2015


Works referencing / citing this record:

Status and future perspectives for lattice gauge theory calculations to the exascale and beyond
journal, November 2019

  • Joó, Bálint; Jung, Chulwoo; Christ, Norman H.
  • The European Physical Journal A, Vol. 55, Issue 11
  • DOI: 10.1140/epja/i2019-12919-7