Pushing memory bandwidth limitations through efficient implementations of BlockKrylov space solvers on GPUs
Abstract
Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memorybandwidth constraints of the matrixvector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a superlinear speed up. BlockKrylov solvers can naturally take advantage of such batched matrixvector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vectorvector operations. Here, using the QUDA library, we present an implementation of a blockCG solver on NVIDIA GPUs which reduces the memorybandwidth complexity of vectorvector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highlyoptimized independent Krylov solves on NVIDIA's SaturnV cluster.
 Authors:

 NVIDIA Corporation, Santa Clara, CA (United States)
 Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
 Univ. of Utah, Salt Lake City, UT (United States). Dept. of Physics and Astronomy
 NVIDIA GmbH, Würselen (Germany)
 Boston Univ., MA (United States). Dept. of Physics
 Publication Date:
 Research Org.:
 Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
 Sponsoring Org.:
 USDOE Office of Science (SC), High Energy Physics (HEP) (SC25)
 OSTI Identifier:
 1418147
 Report Number(s):
 arXiv:1710.09745; FERMILABPUB17592CD
Journal ID: ISSN 00104655; 1632766
 Grant/Contract Number:
 AC0207CH11359
 Resource Type:
 Accepted Manuscript
 Journal Name:
 Computer Physics Communications
 Additional Journal Information:
 Journal Volume: 233; Journal Issue: C; Journal ID: ISSN 00104655
 Publisher:
 Elsevier
 Country of Publication:
 United States
 Language:
 English
 Subject:
 72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS; 97 MATHEMATICS AND COMPUTING; Block solver; GPU
Citation Formats
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Pushing memory bandwidth limitations through efficient implementations of BlockKrylov space solvers on GPUs. United States: N. p., 2018.
Web. doi:10.1016/j.cpc.2018.06.019.
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, & Weinberg, Evan. Pushing memory bandwidth limitations through efficient implementations of BlockKrylov space solvers on GPUs. United States. doi:10.1016/j.cpc.2018.06.019.
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Mon .
"Pushing memory bandwidth limitations through efficient implementations of BlockKrylov space solvers on GPUs". United States. doi:10.1016/j.cpc.2018.06.019. https://www.osti.gov/servlets/purl/1418147.
@article{osti_1418147,
title = {Pushing memory bandwidth limitations through efficient implementations of BlockKrylov space solvers on GPUs},
author = {Clark, M. A. and Strelchenko, Alexei and Vaquero, Alejandro and Wagner, Mathias and Weinberg, Evan},
abstractNote = {Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memorybandwidth constraints of the matrixvector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a superlinear speed up. BlockKrylov solvers can naturally take advantage of such batched matrixvector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vectorvector operations. Here, using the QUDA library, we present an implementation of a blockCG solver on NVIDIA GPUs which reduces the memorybandwidth complexity of vectorvector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highlyoptimized independent Krylov solves on NVIDIA's SaturnV cluster.},
doi = {10.1016/j.cpc.2018.06.019},
journal = {Computer Physics Communications},
number = C,
volume = 233,
place = {United States},
year = {2018},
month = {7}
}
Web of Science