skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs

Abstract

Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.

Authors:
 [1];  [2];  [3];  [1];  [4]
  1. NVIDIA Corp., Santa Clara
  2. Fermilab
  3. Utah U.
  4. Boston U.
Publication Date:
Research Org.:
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
OSTI Identifier:
1418147
Report Number(s):
arXiv:1710.09745; FERMILAB-PUB-17-592-CD
1632766
DOE Contract Number:
AC02-07CH11359
Resource Type:
Journal Article
Resource Relation:
Journal Name: TBD
Country of Publication:
United States
Language:
English
Subject:
72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS

Citation Formats

Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs. United States: N. p., 2017. Web.
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, & Weinberg, Evan. Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs. United States.
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Thu . "Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs". United States. doi:. https://www.osti.gov/servlets/purl/1418147.
@article{osti_1418147,
title = {Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs},
author = {Clark, M. A. and Strelchenko, Alexei and Vaquero, Alejandro and Wagner, Mathias and Weinberg, Evan},
abstractNote = {Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.},
doi = {},
journal = {TBD},
number = ,
volume = ,
place = {United States},
year = {Thu Oct 26 00:00:00 EDT 2017},
month = {Thu Oct 26 00:00:00 EDT 2017}
}