Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs

Clark, M. A.; Strelchenko, Alexei; Vaquero, Alejandro; Wagner, Mathias; Weinberg, Evan

doi:10.1016/j.cpc.2018.06.019

Title: Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs

Journal Article · Mon Jul 02 00:00:00 EDT 2018 · Computer Physics Communications

DOI:https://doi.org/10.1016/j.cpc.2018.06.019· OSTI ID:1418147

Clark, M. A. ^[1]; Strelchenko, Alexei ^[2];

^[3]; Wagner, Mathias ^[4]; Weinberg, Evan ^[5]

NVIDIA Corporation, Santa Clara, CA (United States)
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Univ. of Utah, Salt Lake City, UT (United States). Dept. of Physics and Astronomy
NVIDIA GmbH, Würselen (Germany)
Boston Univ., MA (United States). Dept. of Physics

Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Here, using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.

View Accepted Manuscript (DOE)

View Accepted Manuscript (Publisher)

Cite

Export

Save

Research Organization:: Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)

Sponsoring Organization:: USDOE Office of Science (SC), High Energy Physics (HEP); USDOE National Nuclear Security Administration (NNSA)

Grant/Contract Number:: AC02-07CH11359

OSTI ID:: 1418147

Alternate ID(s):: OSTI ID: 1734408

Report Number(s):: arXiv:1710.09745; FERMILAB-PUB-17-592-CD; 1632766

Journal Information:: Computer Physics Communications, Vol. 233, Issue C; ISSN 0010-4655

Publisher:: ElsevierCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 5 works

Citation information provided by
Web of Science

References (34)

Variable Block CG Algorithms for Solving Large Sparse Symmetric Positive Definite Linear Systems on Parallel Computers, I: General Iterative Scheme Nikishin, A. A.; Yeremin, A. Yu. SIAM Journal on Matrix Analysis and Applications, Vol. 16, Issue 4 https://doi.org/10.1137/S0895479893247679	journal	October 1995
Solving lattice QCD systems of equations using mixed precision solvers on GPUs Clark, M. A.; Babich, R.; Barros, K. Computer Physics Communications, Vol. 181, Issue 9 https://doi.org/10.1016/j.cpc.2010.05.002	journal	September 2010
Multiple right-hand side techniques for the numerical simulation of quasistatic electric and magnetic fields Clemens, Markus; Helias, Moritz; Steinmetz, Thorsten Journal of Computational and Applied Mathematics, Vol. 215, Issue 2 https://doi.org/10.1016/j.cam.2006.04.072	journal	June 2008
A review of block Krylov subspace methods for multisource electromagnetic modelling Puzyrev, Vladimir; Cela, José María Geophysical Journal International, Vol. 202, Issue 2 https://doi.org/10.1093/gji/ggv216	journal	June 2015
Block Krylov Recycling Algorithms for FETI-2LM Applied to 3-D Electromagnetic Wave Scattering and Radiation Roux, Francois-Xavier; Barka, Andre IEEE Transactions on Antennas and Propagation, Vol. 65, Issue 4 https://doi.org/10.1109/TAP.2017.2670541	journal	April 2017
Computing and Deflating Eigenvalues While Solving Multiple Right-Hand Side Linear Systems with an Application to Quantum Chromodynamics Stathopoulos, Andreas; Orginos, Konstantinos SIAM Journal on Scientific Computing, Vol. 32, Issue 1 https://doi.org/10.1137/080725532	journal	January 2010
Adaptive Multigrid Algorithm for Lattice QCD Brannick, J.; Brower, R. C.; Clark, M. A. Physical Review Letters, Vol. 100, Issue 4 https://doi.org/10.1103/PhysRevLett.100.041601	journal	January 2008
Adaptive Multigrid Algorithm for the Lattice Wilson-Dirac Operator Babich, R.; Brannick, J.; Brower, R. C. Physical Review Letters, Vol. 105, Issue 20 https://doi.org/10.1103/PhysRevLett.105.201602	journal	November 2010
Local coherence and deflation of the low quark modes in lattice QCD Lüscher, Martin Journal of High Energy Physics, Vol. 2007, Issue 07 https://doi.org/10.1088/1126-6708/2007/07/081	journal	July 2007
Flexible Variants of Block Restarted GMRES Methods with Application to Geophysics Calandra, Henri; Gratton, Serge; Langou, Julien SIAM Journal on Scientific Computing, Vol. 34, Issue 2 https://doi.org/10.1137/10082364X	journal	January 2012
A breakdown-free block conjugate gradient method Ji, Hao; Li, Yaohang BIT Numerical Mathematics, Vol. 57, Issue 2 https://doi.org/10.1007/s10543-016-0631-z	journal	October 2016
Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals van der Vorst, Henk A.; Ye, Qiang SIAM Journal on Scientific Computing, Vol. 22, Issue 3 https://doi.org/10.1137/S1064827599353865	journal	January 2000
Lattice QCD as a video game Egri, Győző I.; Fodor, Zoltán; Hoelbling, Christian Computer Physics Communications, Vol. 177, Issue 8 https://doi.org/10.1016/j.cpc.2007.06.005	journal	October 2007
Efficient Implementation of the Overlap Operator on Multi-GPUs Alexandru, Andrei; Lujan, Michael; Pelissier, Craig 2011 Symposium on Application Accelerators in High-Performance Computing (SAAHPC) https://doi.org/10.1109/SAAHPC.2011.13	conference	July 2011
The Chroma Software System for Lattice QCD Edwards, Robert G.; Joó, Bálint Nuclear Physics B - Proceedings Supplements, Vol. 140 https://doi.org/10.1016/j.nuclphysbps.2004.11.254	journal	March 2005
A Framework for Lattice QCD Calculations on GPUs Winter, F. T.; Clark, M. A.; Edwards, R. G. 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.112	conference	May 2014
The block conjugate gradient algorithm and related methods O'Leary, Dianne P. Linear Algebra and its Applications, Vol. 29 https://doi.org/10.1016/0024-3795(80)90247-5	journal	February 1980
Application of block Krylov subspace algorithms to the Wilson–Dirac equation with multiple right-hand sides in lattice QCD Sakurai, T.; Tadano, H.; Kuramashi, Y. Computer Physics Communications, Vol. 181, Issue 1 https://doi.org/10.1016/j.cpc.2009.09.006	journal	January 2010
Application of preconditioned block BiCGGR to the Wilson–Dirac equation with multiple right-hand sides in lattice QCD Tadano, H.; Kuramashi, Y.; Sakurai, T. Computer Physics Communications, Vol. 181, Issue 5 https://doi.org/10.1016/j.cpc.2009.12.025	journal	May 2010
Modified block BiCGSTAB for lattice QCD Nakamura, Y.; Ishikawa, K. -I.; Kuramashi, Y. Computer Physics Communications, Vol. 183, Issue 1 https://doi.org/10.1016/j.cpc.2011.08.010	journal	January 2012
A deflated conjugate gradient method for multiple right hand sides and multiple shifts Birk, Sebastian; Frommer, Andreas Numerical Algorithms, Vol. 67, Issue 3 https://doi.org/10.1007/s11075-013-9805-9	journal	November 2013
The QCD finite temperature transition and hybrid Monte Carlo Bitar, Khalil; Kennedy, A. D.; Horsley, Roger Nuclear Physics B, Vol. 313, Issue 2 https://doi.org/10.1016/0550-3213(89)90323-4	journal	February 1989
Hamiltonian formulation of Wilson's lattice gauge theories Kogut, John; Susskind, Leonard Physical Review D, Vol. 11, Issue 2 https://doi.org/10.1103/PhysRevD.11.395	journal	January 1975
Further Improvements to staggered quarks Follana, Eduardo; Mason, Quentin; Davies, Christine Nuclear Physics B - Proceedings Supplements, Vol. 129-130 https://doi.org/10.1016/S0920-5632(03)02610-0	journal	March 2004
Methods of conjugate gradients for solving linear systems Hestenes, M. R.; Stiefel, E. Journal of Research of the National Bureau of Standards, Vol. 49, Issue 6 https://doi.org/10.6028/jres.049.044	journal	December 1952
Roundoff error analysis of the CholeskyQR2 algorithm in an oblique inner product Yamamoto, Yusaku; Nakatsukasa, Yuji; Yanagisawa, Yuka JSIAM Letters, Vol. 8, Issue 0 https://doi.org/10.14495/jsiaml.8.5	journal	January 2016
Reliable updated residuals in hybrid Bi-CG methods Sleijpen, G. L. G.; van der Vorst, H. A. Computing, Vol. 56, Issue 2 https://doi.org/10.1007/BF02309342	journal	June 1996
Effective noise reduction techniques for disconnected loops in Lattice QCD Bali, Gunnar S.; Collins, Sara; Schäfer, Andreas Computer Physics Communications, Vol. 181, Issue 9 https://doi.org/10.1016/j.cpc.2010.05.008	journal	September 2010
Block s-step Krylov iterative methods Chronopoulos, Anthony T.; Kucherov, Andrey B. Numerical Linear Algebra with Applications, Vol. 17, Issue 1 https://doi.org/10.1002/nla.643	journal	January 2010
Amesos2 and Belos: Direct and Iterative Solvers for Large Sparse Linear Systems Bavier, Eric; Hoemmen, Mark; Rajamanickam, Sivasankaran Scientific Programming, Vol. 20, Issue 3 https://doi.org/10.1155/2012/243875	journal	January 2012
An Implementation of Block Conjugate Gradient Algorithm on CPU-GPU Processors Ji, Hao; Sosonkina, Masha; Li, Yaohang Unpublished https://doi.org/10.13140/2.1.1703.4561	text	January 2014
Conjugate gradient solvers on Intel Xeon Phi and NVIDIA GPUs. Kaczmarek, O.; Schmidt, C.; Steinbrecher, P. Deutsches Elektronen-Synchrotron, DESY, Hamburg https://doi.org/10.3204/desy-proc-2014-05/28	text	January 2015
Application of block Krylov subspace algorithms to the Wilson-Dirac equation with multiple right-hand sides in lattice QCD Sakurai, T.; Tadano, H.; Kuramashi, Y. arXiv https://doi.org/10.48550/arxiv.0903.4936	text	January 2009
Application of preconditioned block BiCGGR to the Wilson-Dirac equation with multiple right-hand sides in lattice QCD Tadano, H.; Kuramashi, Y.; Sakurai, T. arXiv https://doi.org/10.48550/arxiv.0907.3261	text	January 2009

Cited By (1)

Status and future perspectives for lattice gauge theory calculations to the exascale and beyond Joó, Bálint; Jung, Chulwoo; Christ, Norman H. The European Physical Journal A, Vol. 55, Issue 11 https://doi.org/10.1140/epja/i2019-12919-7	journal	November 2019

Similar Records

Physics-based preconditioning and the Newton-Krylov method for non-equilibrium radiation diffusion

Journal Article · Sat May 20 00:00:00 EDT 2000 · Journal of Computational Physics · OSTI ID:1418147

Mousseau, V A; Knoll, D A; Rider, W J

Acceleration of GPU-based Krylov solvers via data transfer reduction

Journal Article · Wed Apr 08 00:00:00 EDT 2015 · International Journal of High Performance Computing Applications · OSTI ID:1418147

Anzt, Hartwig; Tomov, Stanimire; Luszczek, Piotr; +2 more

Tensor Contraction and Operation Minimization forExtreme Scale Computational Chemistry

Technical Report · Wed Feb 17 00:00:00 EST 2021 · OSTI ID:1418147

Sabin, Gerald; Sadayappan, P.

Related Subjects

72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS
97 MATHEMATICS AND COMPUTING
Block solver
GPU

Title: Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs

Citation Formats

References (34)

Cited By (1)

Similar Records

Related Subjects