# Algorithmic Patterns for $$\mathcal {H}$$-Matrices on Many-Core Processors

## Abstract

In this work, we consider the reformulation of hierarchical (\(\mathcal {H}\)) matrix algorithms for many-core processors with a model implementation on graphics processing units (GPUs). \(\mathcal {H}\) matrices approximate specific dense matrices, e.g., from discretized integral equations or kernel ridge regression, leading to log-linear time complexity in dense matrix–vector products. The parallelization of \(\mathcal {H}\) matrix operations on many-core processors is difficult due to the complex nature of the underlying algorithms. While previous algorithmic advances for many-core hardware focused on *accelerating* existing \(\mathcal {H}\) matrix CPU implementations by many-core processors, we here aim at totally relying on that processor type. As main contribution, we introduce the necessary parallel algorithmic patterns allowing to map the full \(\mathcal {H}\) matrix construction and the fast matrix–vector product to many-core hardware. In this work, crucial ingredients are space filling curves, parallel tree traversal and batching of linear algebra operations. The resulting model GPU implementation hmglib is the, to the best of the authors knowledge, first entirely GPU-based Open Source \(\mathcal {H}\) matrix library of this kind. We investigate application examples as present in kernel ridge regression, Gaussian Process Regression and kernel-based interpolation. In this context, an in-depth performance analysis and a comparative performance studymore »

- Authors:

- Univ. Basel (Switzerland)

- Publication Date:

- Research Org.:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); UT-Battelle LLC/ORNL, Oak Ridge, TN (Unted States)

- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21); Swiss National Science Foundation (SNF)

- OSTI Identifier:
- 1565719

- Grant/Contract Number:
- AC05-00OR22725; 407540_167186

- Resource Type:
- Accepted Manuscript

- Journal Name:
- Journal of Scientific Computing

- Additional Journal Information:
- Journal Volume: 78; Journal Issue: 2; Journal ID: ISSN 0885-7474

- Publisher:
- Springer

- Country of Publication:
- United States

- Language:
- English

- Subject:
- 97 MATHEMATICS AND COMPUTING; Hierarchical matrices; GPU; Batched linear algebra; Many-core parallelization; Space filling curves; Kernel ridge regression

### Citation Formats

```
Zaspel, Peter. Algorithmic Patterns for $$\mathcal {H}$$-Matrices on Many-Core Processors. United States: N. p., 2018.
Web. doi:10.1007/s10915-018-0809-4.
```

```
Zaspel, Peter. Algorithmic Patterns for $$\mathcal {H}$$-Matrices on Many-Core Processors. United States. doi:10.1007/s10915-018-0809-4.
```

```
Zaspel, Peter. Sat .
"Algorithmic Patterns for $$\mathcal {H}$$-Matrices on Many-Core Processors". United States. doi:10.1007/s10915-018-0809-4. https://www.osti.gov/servlets/purl/1565719.
```

```
@article{osti_1565719,
```

title = {Algorithmic Patterns for $$\mathcal {H}$$-Matrices on Many-Core Processors},

author = {Zaspel, Peter},

abstractNote = {In this work, we consider the reformulation of hierarchical (\(\mathcal {H}\)) matrix algorithms for many-core processors with a model implementation on graphics processing units (GPUs). \(\mathcal {H}\) matrices approximate specific dense matrices, e.g., from discretized integral equations or kernel ridge regression, leading to log-linear time complexity in dense matrix–vector products. The parallelization of \(\mathcal {H}\) matrix operations on many-core processors is difficult due to the complex nature of the underlying algorithms. While previous algorithmic advances for many-core hardware focused on accelerating existing \(\mathcal {H}\) matrix CPU implementations by many-core processors, we here aim at totally relying on that processor type. As main contribution, we introduce the necessary parallel algorithmic patterns allowing to map the full \(\mathcal {H}\) matrix construction and the fast matrix–vector product to many-core hardware. In this work, crucial ingredients are space filling curves, parallel tree traversal and batching of linear algebra operations. The resulting model GPU implementation hmglib is the, to the best of the authors knowledge, first entirely GPU-based Open Source \(\mathcal {H}\) matrix library of this kind. We investigate application examples as present in kernel ridge regression, Gaussian Process Regression and kernel-based interpolation. In this context, an in-depth performance analysis and a comparative performance study against a standard multi-core CPU \(\mathcal {H}\) matrix library highlights profound speedups of our many-core parallel approach.},

doi = {10.1007/s10915-018-0809-4},

journal = {Journal of Scientific Computing},

number = 2,

volume = 78,

place = {United States},

year = {2018},

month = {9}

}

*Citation information provided by*

Web of Science

Web of Science

Works referenced in this record:

##
Fast BVH Construction on GPUs

journal, April 2009

- Lauterbach, C.; Garland, M.; Sengupta, S.
- Computer Graphics Forum, Vol. 28, Issue 2

##
Simpler and faster HLBVH with work queues

conference, January 2011

- Garanzha, Kirill; Pantaleoni, Jacopo; McAllister, David
- Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics - HPG '11

##
$${{\fancyscript{H}}} $$ H -LU factorization on many-core systems

journal, June 2013

- Kriemann, Ronald
- Computing and Visualization in Science, Vol. 16, Issue 3

##
ASKIT: An Efficient, Parallel Library for High-Dimensional Kernel Summations

journal, January 2016

- March, William B.; Xiao, Bo; Yu, Chenhan D.
- SIAM Journal on Scientific Computing, Vol. 38, Issue 5

##
Parallel Construction of Quadtrees and Quality Triangulations

journal, December 1999

- Bern, Marshall; Eppstein, David; Teng, Shang-Hua
- International Journal of Computational Geometry & Applications, Vol. 09, Issue 06

##
Introduction to hierarchical matrices with applications

journal, May 2003

- Börm, Steffen; Grasedyck, Lars; Hackbusch, Wolfgang
- Engineering Analysis with Boundary Elements, Vol. 27, Issue 5

##
Scalable GPU graph traversal

conference, January 2012

- Merrill, Duane; Garland, Michael; Grimshaw, Andrew
- Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12

##
Survey on the Technique of Hierarchical Matrices

journal, September 2015

- Hackbusch, Wolfgang
- Vietnam Journal of Mathematics, Vol. 44, Issue 1

##
Task-Based FMM for Multicore Architectures

journal, January 2014

- Agullo, Emmanuel; Bramas, Bérenger; Coulaud, Olivier
- SIAM Journal on Scientific Computing, Vol. 36, Issue 1

##
Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression

journal, May 2018

- Boukaram, Wajih Halim; Turkiyyah, George; Ltaief, Hatem
- Parallel Computing, Vol. 74

##
A bridging model for parallel computation

journal, August 1990

- Valiant, Leslie G.
- Communications of the ACM, Vol. 33, Issue 8

##
Boost.Compute: A parallel computing library for C++ based on OpenCL

conference, January 2016

- Szuppe, Jakub
- Proceedings of the 4th International Workshop on OpenCL - IWOCL '16

##
Adaptive Low-Rank Approximation of Collocation Matrices

journal, February 2003

- Bebendorf, M.; Rjasanow, S.
- Computing, Vol. 70, Issue 1

##
A new version of the Fast Multipole Method for the Laplace equation in three dimensions

journal, January 1997

- Greengard, Leslie; Rokhlin, Vladimir
- Acta Numerica, Vol. 6

##
FMM-based vortex method for simulation of isotropic turbulence on GPUs, compared with a spectral method

journal, July 2013

- Yokota, Rio; Barba, L. A.
- Computers & Fluids, Vol. 80

##
ℋ2-matrices – Multilevel methods for the approximation of integral operators

journal, October 2004

- Börm, Steffen
- Computing and Visualization in Science, Vol. 7, Issue 3-4

##
Parallel -Matrix Arithmetics on Shared Memory Systems

journal, December 2004

- Kriemann, R.
- Computing, Vol. 74, Issue 3

##
Parallel black box $$\mathcal {H}$$ -LU preconditioning for elliptic boundary value problems

journal, April 2008

- Grasedyck, Lars; Kriemann, Ronald; Le Borne, Sabine
- Computing and Visualization in Science, Vol. 11, Issue 4-6

##
On the fast matrix multiplication in the boundary element method by panel clustering

journal, July 1989

- Hackbusch, W.; Nowak, Z. P.
- Numerische Mathematik, Vol. 54, Issue 4

##
-matrix approximation of integral operators by interpolation

journal, October 2002

- Hackbusch, Wolfgang; Börm, Steffen
- Applied Numerical Mathematics, Vol. 43, Issue 1-2

##
Recompression techniques for adaptive cross approximation

journal, September 2009

- Bebendorf, M.; Kunis, S.
- Journal of Integral Equations and Applications, Vol. 21, Issue 3

##
A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization

journal, June 2016

- Rouet, François-Henry; Li, Xiaoye S.; Ghysels, Pieter
- ACM Transactions on Mathematical Software, Vol. 42, Issue 4

##
Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs

conference, January 2017

- Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire
- Proceedings of the International Conference on Supercomputing - ICS '17

##
PetRBF — A parallel O(N) algorithm for radial basis function interpolation with Gaussians

journal, May 2010

- Yokota, Rio; Barba, L. A.; Knepley, Matthew G.
- Computer Methods in Applied Mechanics and Engineering, Vol. 199, Issue 25-28

##
An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling

journal, January 2016

- Ghysels, Pieter; Li, Xiaoye S.; Rouet, François-Henry
- SIAM Journal on Scientific Computing, Vol. 38, Issue 5