skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs

Abstract

Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Here, using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.

Authors:
 [1];  [2]; ORCiD logo [3];  [4];  [5]
  1. NVIDIA Corporation, Santa Clara, CA (United States)
  2. Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
  3. Univ. of Utah, Salt Lake City, UT (United States). Dept. of Physics and Astronomy
  4. NVIDIA GmbH, Würselen (Germany)
  5. Boston Univ., MA (United States). Dept. of Physics
Publication Date:
Research Org.:
Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)
Sponsoring Org.:
USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)
OSTI Identifier:
1418147
Report Number(s):
arXiv:1710.09745; FERMILAB-PUB-17-592-CD
Journal ID: ISSN 0010-4655; 1632766
Grant/Contract Number:  
AC02-07CH11359
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Computer Physics Communications
Additional Journal Information:
Journal Volume: 233; Journal Issue: C; Journal ID: ISSN 0010-4655
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS; 97 MATHEMATICS AND COMPUTING; Block solver; GPU

Citation Formats

Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs. United States: N. p., 2018. Web. doi:10.1016/j.cpc.2018.06.019.
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, & Weinberg, Evan. Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs. United States. doi:10.1016/j.cpc.2018.06.019.
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Mon . "Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs". United States. doi:10.1016/j.cpc.2018.06.019.
@article{osti_1418147,
title = {Pushing memory bandwidth limitations through efficient implementations of Block-Krylov space solvers on GPUs},
author = {Clark, M. A. and Strelchenko, Alexei and Vaquero, Alejandro and Wagner, Mathias and Weinberg, Evan},
abstractNote = {Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Here, using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.},
doi = {10.1016/j.cpc.2018.06.019},
journal = {Computer Physics Communications},
number = C,
volume = 233,
place = {United States},
year = {Mon Jul 02 00:00:00 EDT 2018},
month = {Mon Jul 02 00:00:00 EDT 2018}
}

Journal Article:
Free Publicly Available Full Text
This content will become publicly available on July 2, 2019
Publisher's Version of Record

Save / Share: