Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs
Abstract
Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Here, using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.
- Authors:
-
- NVIDIA Corporation, Santa Clara, CA (United States)
- Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
- Univ. of Utah, Salt Lake City, UT (United States). Dept. of Physics and Astronomy
- NVIDIA GmbH, Würselen (Germany)
- Boston Univ., MA (United States). Dept. of Physics
- Publication Date:
- Research Org.:
- Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), High Energy Physics (HEP); USDOE National Nuclear Security Administration (NNSA)
- OSTI Identifier:
- 1418147
- Alternate Identifier(s):
- OSTI ID: 1734408
- Report Number(s):
- arXiv:1710.09745; FERMILAB-PUB-17-592-CD
Journal ID: ISSN 0010-4655; 1632766
- Grant/Contract Number:
- AC02-07CH11359
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Computer Physics Communications
- Additional Journal Information:
- Journal Volume: 233; Journal Issue: C; Journal ID: ISSN 0010-4655
- Publisher:
- Elsevier
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS; 97 MATHEMATICS AND COMPUTING; Block solver; GPU
Citation Formats
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs. United States: N. p., 2018.
Web. doi:10.1016/j.cpc.2018.06.019.
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, & Weinberg, Evan. Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs. United States. https://doi.org/10.1016/j.cpc.2018.06.019
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Mon .
"Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs". United States. https://doi.org/10.1016/j.cpc.2018.06.019. https://www.osti.gov/servlets/purl/1418147.
@article{osti_1418147,
title = {Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs},
author = {Clark, M. A. and Strelchenko, Alexei and Vaquero, Alejandro and Wagner, Mathias and Weinberg, Evan},
abstractNote = {Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Here, using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.},
doi = {10.1016/j.cpc.2018.06.019},
journal = {Computer Physics Communications},
number = C,
volume = 233,
place = {United States},
year = {2018},
month = {7}
}
Web of Science
Works referenced in this record:
Variable Block CG Algorithms for Solving Large Sparse Symmetric Positive Definite Linear Systems on Parallel Computers, I: General Iterative Scheme
journal, October 1995
- Nikishin, A. A.; Yeremin, A. Yu.
- SIAM Journal on Matrix Analysis and Applications, Vol. 16, Issue 4
Solving lattice QCD systems of equations using mixed precision solvers on GPUs
journal, September 2010
- Clark, M. A.; Babich, R.; Barros, K.
- Computer Physics Communications, Vol. 181, Issue 9
Multiple right-hand side techniques for the numerical simulation of quasistatic electric and magnetic fields
journal, June 2008
- Clemens, Markus; Helias, Moritz; Steinmetz, Thorsten
- Journal of Computational and Applied Mathematics, Vol. 215, Issue 2
A review of block Krylov subspace methods for multisource electromagnetic modelling
journal, June 2015
- Puzyrev, Vladimir; Cela, José María
- Geophysical Journal International, Vol. 202, Issue 2
Block Krylov Recycling Algorithms for FETI-2LM Applied to 3-D Electromagnetic Wave Scattering and Radiation
journal, April 2017
- Roux, Francois-Xavier; Barka, Andre
- IEEE Transactions on Antennas and Propagation, Vol. 65, Issue 4
Computing and Deflating Eigenvalues While Solving Multiple Right-Hand Side Linear Systems with an Application to Quantum Chromodynamics
journal, January 2010
- Stathopoulos, Andreas; Orginos, Konstantinos
- SIAM Journal on Scientific Computing, Vol. 32, Issue 1
Adaptive Multigrid Algorithm for Lattice QCD
journal, January 2008
- Brannick, J.; Brower, R. C.; Clark, M. A.
- Physical Review Letters, Vol. 100, Issue 4
Adaptive Multigrid Algorithm for the Lattice Wilson-Dirac Operator
journal, November 2010
- Babich, R.; Brannick, J.; Brower, R. C.
- Physical Review Letters, Vol. 105, Issue 20
Local coherence and deflation of the low quark modes in lattice QCD
journal, July 2007
- Lüscher, Martin
- Journal of High Energy Physics, Vol. 2007, Issue 07
Flexible Variants of Block Restarted GMRES Methods with Application to Geophysics
journal, January 2012
- Calandra, Henri; Gratton, Serge; Langou, Julien
- SIAM Journal on Scientific Computing, Vol. 34, Issue 2
A breakdown-free block conjugate gradient method
journal, October 2016
- Ji, Hao; Li, Yaohang
- BIT Numerical Mathematics, Vol. 57, Issue 2
Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals
journal, January 2000
- van der Vorst, Henk A.; Ye, Qiang
- SIAM Journal on Scientific Computing, Vol. 22, Issue 3
Lattice QCD as a video game
journal, October 2007
- Egri, Győző I.; Fodor, Zoltán; Hoelbling, Christian
- Computer Physics Communications, Vol. 177, Issue 8
Efficient Implementation of the Overlap Operator on Multi-GPUs
conference, July 2011
- Alexandru, Andrei; Lujan, Michael; Pelissier, Craig
- 2011 Symposium on Application Accelerators in High-Performance Computing (SAAHPC)
The Chroma Software System for Lattice QCD
journal, March 2005
- Edwards, Robert G.; Joó, Bálint
- Nuclear Physics B - Proceedings Supplements, Vol. 140
A Framework for Lattice QCD Calculations on GPUs
conference, May 2014
- Winter, F. T.; Clark, M. A.; Edwards, R. G.
- 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
The block conjugate gradient algorithm and related methods
journal, February 1980
- O'Leary, Dianne P.
- Linear Algebra and its Applications, Vol. 29
Application of block Krylov subspace algorithms to the Wilson–Dirac equation with multiple right-hand sides in lattice QCD
journal, January 2010
- Sakurai, T.; Tadano, H.; Kuramashi, Y.
- Computer Physics Communications, Vol. 181, Issue 1
Application of preconditioned block BiCGGR to the Wilson–Dirac equation with multiple right-hand sides in lattice QCD
journal, May 2010
- Tadano, H.; Kuramashi, Y.; Sakurai, T.
- Computer Physics Communications, Vol. 181, Issue 5
Modified block BiCGSTAB for lattice QCD
journal, January 2012
- Nakamura, Y.; Ishikawa, K. -I.; Kuramashi, Y.
- Computer Physics Communications, Vol. 183, Issue 1
A deflated conjugate gradient method for multiple right hand sides and multiple shifts
journal, November 2013
- Birk, Sebastian; Frommer, Andreas
- Numerical Algorithms, Vol. 67, Issue 3
The QCD finite temperature transition and hybrid Monte Carlo
journal, February 1989
- Bitar, Khalil; Kennedy, A. D.; Horsley, Roger
- Nuclear Physics B, Vol. 313, Issue 2
Hamiltonian formulation of Wilson's lattice gauge theories
journal, January 1975
- Kogut, John; Susskind, Leonard
- Physical Review D, Vol. 11, Issue 2
Further Improvements to staggered quarks
journal, March 2004
- Follana, Eduardo; Mason, Quentin; Davies, Christine
- Nuclear Physics B - Proceedings Supplements, Vol. 129-130
Methods of conjugate gradients for solving linear systems
journal, December 1952
- Hestenes, M. R.; Stiefel, E.
- Journal of Research of the National Bureau of Standards, Vol. 49, Issue 6
Roundoff error analysis of the CholeskyQR2 algorithm in an oblique inner product
journal, January 2016
- Yamamoto, Yusaku; Nakatsukasa, Yuji; Yanagisawa, Yuka
- JSIAM Letters, Vol. 8, Issue 0
Reliable updated residuals in hybrid Bi-CG methods
journal, June 1996
- Sleijpen, G. L. G.; van der Vorst, H. A.
- Computing, Vol. 56, Issue 2
Effective noise reduction techniques for disconnected loops in Lattice QCD
journal, September 2010
- Bali, Gunnar S.; Collins, Sara; Schäfer, Andreas
- Computer Physics Communications, Vol. 181, Issue 9
Block s-step Krylov iterative methods
journal, January 2010
- Chronopoulos, Anthony T.; Kucherov, Andrey B.
- Numerical Linear Algebra with Applications, Vol. 17, Issue 1
Lattice QCD as a video game
journal, October 2007
- Egri, Győző I.; Fodor, Zoltán; Hoelbling, Christian
- Computer Physics Communications, Vol. 177, Issue 8
Amesos2 and Belos: Direct and Iterative Solvers for Large Sparse Linear Systems
journal, January 2012
- Bavier, Eric; Hoemmen, Mark; Rajamanickam, Sivasankaran
- Scientific Programming, Vol. 20, Issue 3
An Implementation of Block Conjugate Gradient Algorithm on CPU-GPU Processors
text, January 2014
- Ji, Hao; Sosonkina, Masha; Li, Yaohang
- Unpublished
Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs.
text, January 2015
- Kaczmarek, O.; Schmidt, C.; Steinbrecher, P.
- Deutsches Elektronen-Synchrotron, DESY, Hamburg
Application of block Krylov subspace algorithms to the Wilson-Dirac equation with multiple right-hand sides in lattice QCD
text, January 2009
- Sakurai, T.; Tadano, H.; Kuramashi, Y.
- arXiv
Application of preconditioned block BiCGGR to the Wilson-Dirac equation with multiple right-hand sides in lattice QCD
text, January 2009
- Tadano, H.; Kuramashi, Y.; Sakurai, T.
- arXiv
Works referencing / citing this record:
Status and future perspectives for lattice gauge theory calculations to the exascale and beyond
journal, November 2019
- Joó, Bálint; Jung, Chulwoo; Christ, Norman H.
- The European Physical Journal A, Vol. 55, Issue 11