Pushing memory bandwidth limitations through efficient implementations of BlockKrylov space solvers on GPUs
Abstract
Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memorybandwidth constraints of the matrixvector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a superlinear speed up. BlockKrylov solvers can naturally take advantage of such batched matrixvector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vectorvector operations. Here, using the QUDA library, we present an implementation of a blockCG solver on NVIDIA GPUs which reduces the memorybandwidth complexity of vectorvector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highlyoptimized independent Krylov solves on NVIDIA's SaturnV cluster.
 Authors:

 NVIDIA Corporation, Santa Clara, CA (United States)
 Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
 Univ. of Utah, Salt Lake City, UT (United States). Dept. of Physics and Astronomy
 NVIDIA GmbH, Würselen (Germany)
 Boston Univ., MA (United States). Dept. of Physics
 Publication Date:
 Research Org.:
 Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)
 Sponsoring Org.:
 USDOE Office of Science (SC), High Energy Physics (HEP); USDOE National Nuclear Security Administration (NNSA)
 OSTI Identifier:
 1418147
 Alternate Identifier(s):
 OSTI ID: 1734408
 Report Number(s):
 arXiv:1710.09745; FERMILABPUB17592CD
Journal ID: ISSN 00104655; 1632766
 Grant/Contract Number:
 AC0207CH11359
 Resource Type:
 Accepted Manuscript
 Journal Name:
 Computer Physics Communications
 Additional Journal Information:
 Journal Volume: 233; Journal Issue: C; Journal ID: ISSN 00104655
 Publisher:
 Elsevier
 Country of Publication:
 United States
 Language:
 English
 Subject:
 72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS; 97 MATHEMATICS AND COMPUTING; Block solver; GPU
Citation Formats
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Pushing memory bandwidth limitations through efficient implementations of BlockKrylov space solvers on GPUs. United States: N. p., 2018.
Web. doi:10.1016/j.cpc.2018.06.019.
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, & Weinberg, Evan. Pushing memory bandwidth limitations through efficient implementations of BlockKrylov space solvers on GPUs. United States. https://doi.org/10.1016/j.cpc.2018.06.019
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Mon .
"Pushing memory bandwidth limitations through efficient implementations of BlockKrylov space solvers on GPUs". United States. https://doi.org/10.1016/j.cpc.2018.06.019. https://www.osti.gov/servlets/purl/1418147.
@article{osti_1418147,
title = {Pushing memory bandwidth limitations through efficient implementations of BlockKrylov space solvers on GPUs},
author = {Clark, M. A. and Strelchenko, Alexei and Vaquero, Alejandro and Wagner, Mathias and Weinberg, Evan},
abstractNote = {Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memorybandwidth constraints of the matrixvector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a superlinear speed up. BlockKrylov solvers can naturally take advantage of such batched matrixvector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vectorvector operations. Here, using the QUDA library, we present an implementation of a blockCG solver on NVIDIA GPUs which reduces the memorybandwidth complexity of vectorvector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highlyoptimized independent Krylov solves on NVIDIA's SaturnV cluster.},
doi = {10.1016/j.cpc.2018.06.019},
journal = {Computer Physics Communications},
number = C,
volume = 233,
place = {United States},
year = {2018},
month = {7}
}
Web of Science
Works referenced in this record:
Variable Block CG Algorithms for Solving Large Sparse Symmetric Positive Definite Linear Systems on Parallel Computers, I: General Iterative Scheme
journal, October 1995
 Nikishin, A. A.; Yeremin, A. Yu.
 SIAM Journal on Matrix Analysis and Applications, Vol. 16, Issue 4
Solving lattice QCD systems of equations using mixed precision solvers on GPUs
journal, September 2010
 Clark, M. A.; Babich, R.; Barros, K.
 Computer Physics Communications, Vol. 181, Issue 9
Multiple righthand side techniques for the numerical simulation of quasistatic electric and magnetic fields
journal, June 2008
 Clemens, Markus; Helias, Moritz; Steinmetz, Thorsten
 Journal of Computational and Applied Mathematics, Vol. 215, Issue 2
A review of block Krylov subspace methods for multisource electromagnetic modelling
journal, June 2015
 Puzyrev, Vladimir; Cela, José María
 Geophysical Journal International, Vol. 202, Issue 2
Block Krylov Recycling Algorithms for FETI2LM Applied to 3D Electromagnetic Wave Scattering and Radiation
journal, April 2017
 Roux, FrancoisXavier; Barka, Andre
 IEEE Transactions on Antennas and Propagation, Vol. 65, Issue 4
Computing and Deflating Eigenvalues While Solving Multiple RightHand Side Linear Systems with an Application to Quantum Chromodynamics
journal, January 2010
 Stathopoulos, Andreas; Orginos, Konstantinos
 SIAM Journal on Scientific Computing, Vol. 32, Issue 1
Adaptive Multigrid Algorithm for Lattice QCD
journal, January 2008
 Brannick, J.; Brower, R. C.; Clark, M. A.
 Physical Review Letters, Vol. 100, Issue 4
Adaptive Multigrid Algorithm for the Lattice WilsonDirac Operator
journal, November 2010
 Babich, R.; Brannick, J.; Brower, R. C.
 Physical Review Letters, Vol. 105, Issue 20
Local coherence and deflation of the low quark modes in lattice QCD
journal, July 2007
 Lüscher, Martin
 Journal of High Energy Physics, Vol. 2007, Issue 07
Flexible Variants of Block Restarted GMRES Methods with Application to Geophysics
journal, January 2012
 Calandra, Henri; Gratton, Serge; Langou, Julien
 SIAM Journal on Scientific Computing, Vol. 34, Issue 2
A breakdownfree block conjugate gradient method
journal, October 2016
 Ji, Hao; Li, Yaohang
 BIT Numerical Mathematics, Vol. 57, Issue 2
Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals
journal, January 2000
 van der Vorst, Henk A.; Ye, Qiang
 SIAM Journal on Scientific Computing, Vol. 22, Issue 3
Lattice QCD as a video game
journal, October 2007
 Egri, Győző I.; Fodor, Zoltán; Hoelbling, Christian
 Computer Physics Communications, Vol. 177, Issue 8
Efficient Implementation of the Overlap Operator on MultiGPUs
conference, July 2011
 Alexandru, Andrei; Lujan, Michael; Pelissier, Craig
 2011 Symposium on Application Accelerators in HighPerformance Computing (SAAHPC)
The Chroma Software System for Lattice QCD
journal, March 2005
 Edwards, Robert G.; Joó, Bálint
 Nuclear Physics B  Proceedings Supplements, Vol. 140
A Framework for Lattice QCD Calculations on GPUs
conference, May 2014
 Winter, F. T.; Clark, M. A.; Edwards, R. G.
 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
The block conjugate gradient algorithm and related methods
journal, February 1980
 O'Leary, Dianne P.
 Linear Algebra and its Applications, Vol. 29
Application of block Krylov subspace algorithms to the Wilson–Dirac equation with multiple righthand sides in lattice QCD
journal, January 2010
 Sakurai, T.; Tadano, H.; Kuramashi, Y.
 Computer Physics Communications, Vol. 181, Issue 1
Application of preconditioned block BiCGGR to the Wilson–Dirac equation with multiple righthand sides in lattice QCD
journal, May 2010
 Tadano, H.; Kuramashi, Y.; Sakurai, T.
 Computer Physics Communications, Vol. 181, Issue 5
Modified block BiCGSTAB for lattice QCD
journal, January 2012
 Nakamura, Y.; Ishikawa, K. I.; Kuramashi, Y.
 Computer Physics Communications, Vol. 183, Issue 1
A deflated conjugate gradient method for multiple right hand sides and multiple shifts
journal, November 2013
 Birk, Sebastian; Frommer, Andreas
 Numerical Algorithms, Vol. 67, Issue 3
The QCD finite temperature transition and hybrid Monte Carlo
journal, February 1989
 Bitar, Khalil; Kennedy, A. D.; Horsley, Roger
 Nuclear Physics B, Vol. 313, Issue 2
Hamiltonian formulation of Wilson's lattice gauge theories
journal, January 1975
 Kogut, John; Susskind, Leonard
 Physical Review D, Vol. 11, Issue 2
Further Improvements to staggered quarks
journal, March 2004
 Follana, Eduardo; Mason, Quentin; Davies, Christine
 Nuclear Physics B  Proceedings Supplements, Vol. 129130
Methods of conjugate gradients for solving linear systems
journal, December 1952
 Hestenes, M. R.; Stiefel, E.
 Journal of Research of the National Bureau of Standards, Vol. 49, Issue 6
Roundoff error analysis of the CholeskyQR2 algorithm in an oblique inner product
journal, January 2016
 Yamamoto, Yusaku; Nakatsukasa, Yuji; Yanagisawa, Yuka
 JSIAM Letters, Vol. 8, Issue 0
Reliable updated residuals in hybrid BiCG methods
journal, June 1996
 Sleijpen, G. L. G.; van der Vorst, H. A.
 Computing, Vol. 56, Issue 2
Effective noise reduction techniques for disconnected loops in Lattice QCD
journal, September 2010
 Bali, Gunnar S.; Collins, Sara; Schäfer, Andreas
 Computer Physics Communications, Vol. 181, Issue 9
Block sstep Krylov iterative methods
journal, January 2010
 Chronopoulos, Anthony T.; Kucherov, Andrey B.
 Numerical Linear Algebra with Applications, Vol. 17, Issue 1
Lattice QCD as a video game
journal, October 2007
 Egri, Győző I.; Fodor, Zoltán; Hoelbling, Christian
 Computer Physics Communications, Vol. 177, Issue 8
Amesos2 and Belos: Direct and Iterative Solvers for Large Sparse Linear Systems
journal, January 2012
 Bavier, Eric; Hoemmen, Mark; Rajamanickam, Sivasankaran
 Scientific Programming, Vol. 20, Issue 3
An Implementation of Block Conjugate Gradient Algorithm on CPUGPU Processors
text, January 2014
 Ji, Hao; Sosonkina, Masha; Li, Yaohang
 Unpublished
Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs.
text, January 2015
 Kaczmarek, O.; Schmidt, C.; Steinbrecher, P.
 Deutsches ElektronenSynchrotron, DESY, Hamburg
Application of block Krylov subspace algorithms to the WilsonDirac equation with multiple righthand sides in lattice QCD
text, January 2009
 Sakurai, T.; Tadano, H.; Kuramashi, Y.
 arXiv
Application of preconditioned block BiCGGR to the WilsonDirac equation with multiple righthand sides in lattice QCD
text, January 2009
 Tadano, H.; Kuramashi, Y.; Sakurai, T.
 arXiv
Works referencing / citing this record:
Status and future perspectives for lattice gauge theory calculations to the exascale and beyond
journal, November 2019
 Joó, Bálint; Jung, Chulwoo; Christ, Norman H.
 The European Physical Journal A, Vol. 55, Issue 11