Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs
Abstract
Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Here, using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.
- Authors:
-
- NVIDIA Corporation, Santa Clara, CA (United States)
- Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
- Univ. of Utah, Salt Lake City, UT (United States). Dept. of Physics and Astronomy
- NVIDIA GmbH, Würselen (Germany)
- Boston Univ., MA (United States). Dept. of Physics
- Publication Date:
- Research Org.:
- Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), High Energy Physics (HEP); USDOE National Nuclear Security Administration (NNSA)
- OSTI Identifier:
- 1418147
- Alternate Identifier(s):
- OSTI ID: 1734408
- Report Number(s):
- arXiv:1710.09745; FERMILAB-PUB-17-592-CD
Journal ID: ISSN 0010-4655; 1632766
- Grant/Contract Number:
- AC02-07CH11359
- Resource Type:
- Journal Article: Accepted Manuscript
- Journal Name:
- Computer Physics Communications
- Additional Journal Information:
- Journal Volume: 233; Journal Issue: C; Journal ID: ISSN 0010-4655
- Publisher:
- Elsevier
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS; 97 MATHEMATICS AND COMPUTING; Block solver; GPU
Citation Formats
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs. United States: N. p., 2018.
Web. doi:10.1016/j.cpc.2018.06.019.
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, & Weinberg, Evan. Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs. United States. https://doi.org/10.1016/j.cpc.2018.06.019
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Mon .
"Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs". United States. https://doi.org/10.1016/j.cpc.2018.06.019. https://www.osti.gov/servlets/purl/1418147.
@article{osti_1418147,
title = {Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs},
author = {Clark, M. A. and Strelchenko, Alexei and Vaquero, Alejandro and Wagner, Mathias and Weinberg, Evan},
abstractNote = {Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Here, using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.},
doi = {10.1016/j.cpc.2018.06.019},
url = {https://www.osti.gov/biblio/1418147},
journal = {Computer Physics Communications},
issn = {0010-4655},
number = C,
volume = 233,
place = {United States},
year = {2018},
month = {7}
}
Web of Science
Works referencing / citing this record:
Status and future perspectives for lattice gauge theory calculations to the exascale and beyond
journal, November 2019
- Joó, Bálint; Jung, Chulwoo; Christ, Norman H.
- The European Physical Journal A, Vol. 55, Issue 11