# Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs

## Abstract

Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.

- Authors:

- NVIDIA Corp., Santa Clara
- Fermilab
- Utah U.
- Boston U.

- Publication Date:

- Research Org.:
- Fermi National Accelerator Lab. (FNAL), Batavia, IL (United States)

- Sponsoring Org.:
- USDOE Office of Science (SC), High Energy Physics (HEP) (SC-25)

- OSTI Identifier:
- 1418147

- Report Number(s):
- arXiv:1710.09745; FERMILAB-PUB-17-592-CD

1632766

- DOE Contract Number:
- AC02-07CH11359

- Resource Type:
- Journal Article

- Resource Relation:
- Journal Name: TBD

- Country of Publication:
- United States

- Language:
- English

- Subject:
- 72 PHYSICS OF ELEMENTARY PARTICLES AND FIELDS

### Citation Formats

```
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan.
```*Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs*. United States: N. p., 2017.
Web.

```
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, & Weinberg, Evan.
```*Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs*. United States.

```
Clark, M. A., Strelchenko, Alexei, Vaquero, Alejandro, Wagner, Mathias, and Weinberg, Evan. Thu .
"Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs". United States.
doi:. https://www.osti.gov/servlets/purl/1418147.
```

```
@article{osti_1418147,
```

title = {Pushing Memory Bandwidth Limitations Through Efficient Implementations of Block-Krylov Space Solvers on GPUs},

author = {Clark, M. A. and Strelchenko, Alexei and Vaquero, Alejandro and Wagner, Mathias and Weinberg, Evan},

abstractNote = {Lattice quantum chromodynamics simulations in nuclear physics have benefited from a tremendous number of algorithmic advances such as multigrid and eigenvector deflation. These improve the time to solution but do not alleviate the intrinsic memory-bandwidth constraints of the matrix-vector operation dominating iterative solvers. Batching this operation for multiple vectors and exploiting cache and register blocking can yield a super-linear speed up. Block-Krylov solvers can naturally take advantage of such batched matrix-vector operations, further reducing the iterations to solution by sharing the Krylov space between solves. However, practical implementations typically suffer from the quadratic scaling in the number of vector-vector operations. Using the QUDA library, we present an implementation of a block-CG solver on NVIDIA GPUs which reduces the memory-bandwidth complexity of vector-vector operations from quadratic to linear. We present results for the HISQ discretization, showing a 5x speedup compared to highly-optimized independent Krylov solves on NVIDIA's SaturnV cluster.},

doi = {},

journal = {TBD},

number = ,

volume = ,

place = {United States},

year = {Thu Oct 26 00:00:00 EDT 2017},

month = {Thu Oct 26 00:00:00 EDT 2017}

}