skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Algorithmic Patterns for $$\mathcal {H}$$-Matrices on Many-Core Processors

Abstract

In this work, we consider the reformulation of hierarchical (\(\mathcal {H}\)) matrix algorithms for many-core processors with a model implementation on graphics processing units (GPUs). \(\mathcal {H}\) matrices approximate specific dense matrices, e.g., from discretized integral equations or kernel ridge regression, leading to log-linear time complexity in dense matrix–vector products. The parallelization of \(\mathcal {H}\) matrix operations on many-core processors is difficult due to the complex nature of the underlying algorithms. While previous algorithmic advances for many-core hardware focused on accelerating existing \(\mathcal {H}\) matrix CPU implementations by many-core processors, we here aim at totally relying on that processor type. As main contribution, we introduce the necessary parallel algorithmic patterns allowing to map the full \(\mathcal {H}\) matrix construction and the fast matrix–vector product to many-core hardware. In this work, crucial ingredients are space filling curves, parallel tree traversal and batching of linear algebra operations. The resulting model GPU implementation hmglib is the, to the best of the authors knowledge, first entirely GPU-based Open Source \(\mathcal {H}\) matrix library of this kind. We investigate application examples as present in kernel ridge regression, Gaussian Process Regression and kernel-based interpolation. In this context, an in-depth performance analysis and a comparative performance studymore » against a standard multi-core CPU \(\mathcal {H}\) matrix library highlights profound speedups of our many-core parallel approach.« less

Authors:
 [1]
  1. Univ. Basel (Switzerland)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); UT-Battelle LLC/ORNL, Oak Ridge, TN (Unted States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21); Swiss National Science Foundation (SNF)
OSTI Identifier:
1565719
Grant/Contract Number:  
AC05-00OR22725; 407540_167186
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Scientific Computing
Additional Journal Information:
Journal Volume: 78; Journal Issue: 2; Journal ID: ISSN 0885-7474
Publisher:
Springer
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Hierarchical matrices; GPU; Batched linear algebra; Many-core parallelization; Space filling curves; Kernel ridge regression

Citation Formats

Zaspel, Peter. Algorithmic Patterns for $$\mathcal {H}$$-Matrices on Many-Core Processors. United States: N. p., 2018. Web. doi:10.1007/s10915-018-0809-4.
Zaspel, Peter. Algorithmic Patterns for $$\mathcal {H}$$-Matrices on Many-Core Processors. United States. doi:10.1007/s10915-018-0809-4.
Zaspel, Peter. Sat . "Algorithmic Patterns for $$\mathcal {H}$$-Matrices on Many-Core Processors". United States. doi:10.1007/s10915-018-0809-4. https://www.osti.gov/servlets/purl/1565719.
@article{osti_1565719,
title = {Algorithmic Patterns for $$\mathcal {H}$$-Matrices on Many-Core Processors},
author = {Zaspel, Peter},
abstractNote = {In this work, we consider the reformulation of hierarchical (\(\mathcal {H}\)) matrix algorithms for many-core processors with a model implementation on graphics processing units (GPUs). \(\mathcal {H}\) matrices approximate specific dense matrices, e.g., from discretized integral equations or kernel ridge regression, leading to log-linear time complexity in dense matrix–vector products. The parallelization of \(\mathcal {H}\) matrix operations on many-core processors is difficult due to the complex nature of the underlying algorithms. While previous algorithmic advances for many-core hardware focused on accelerating existing \(\mathcal {H}\) matrix CPU implementations by many-core processors, we here aim at totally relying on that processor type. As main contribution, we introduce the necessary parallel algorithmic patterns allowing to map the full \(\mathcal {H}\) matrix construction and the fast matrix–vector product to many-core hardware. In this work, crucial ingredients are space filling curves, parallel tree traversal and batching of linear algebra operations. The resulting model GPU implementation hmglib is the, to the best of the authors knowledge, first entirely GPU-based Open Source \(\mathcal {H}\) matrix library of this kind. We investigate application examples as present in kernel ridge regression, Gaussian Process Regression and kernel-based interpolation. In this context, an in-depth performance analysis and a comparative performance study against a standard multi-core CPU \(\mathcal {H}\) matrix library highlights profound speedups of our many-core parallel approach.},
doi = {10.1007/s10915-018-0809-4},
journal = {Journal of Scientific Computing},
number = 2,
volume = 78,
place = {United States},
year = {2018},
month = {9}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Fast BVH Construction on GPUs
journal, April 2009


Simpler and faster HLBVH with work queues
conference, January 2011

  • Garanzha, Kirill; Pantaleoni, Jacopo; McAllister, David
  • Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics - HPG '11
  • DOI: 10.1145/2018323.2018333

$${{\fancyscript{H}}} $$ H -LU factorization on many-core systems
journal, June 2013


ASKIT: An Efficient, Parallel Library for High-Dimensional Kernel Summations
journal, January 2016

  • March, William B.; Xiao, Bo; Yu, Chenhan D.
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 5
  • DOI: 10.1137/15M1026468

Parallel Construction of Quadtrees and Quality Triangulations
journal, December 1999

  • Bern, Marshall; Eppstein, David; Teng, Shang-Hua
  • International Journal of Computational Geometry & Applications, Vol. 09, Issue 06
  • DOI: 10.1142/S0218195999000303

Introduction to hierarchical matrices with applications
journal, May 2003

  • Börm, Steffen; Grasedyck, Lars; Hackbusch, Wolfgang
  • Engineering Analysis with Boundary Elements, Vol. 27, Issue 5
  • DOI: 10.1016/S0955-7997(02)00152-2

Scalable GPU graph traversal
conference, January 2012

  • Merrill, Duane; Garland, Michael; Grimshaw, Andrew
  • Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12
  • DOI: 10.1145/2145816.2145832

Survey on the Technique of Hierarchical Matrices
journal, September 2015


Task-Based FMM for Multicore Architectures
journal, January 2014

  • Agullo, Emmanuel; Bramas, Bérenger; Coulaud, Olivier
  • SIAM Journal on Scientific Computing, Vol. 36, Issue 1
  • DOI: 10.1137/130915662

Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression
journal, May 2018


A bridging model for parallel computation
journal, August 1990


Boost.Compute: A parallel computing library for C++ based on OpenCL
conference, January 2016


Adaptive Low-Rank Approximation of Collocation Matrices
journal, February 2003


A new version of the Fast Multipole Method for the Laplace equation in three dimensions
journal, January 1997


ℋ2-matrices – Multilevel methods for the approximation of integral operators
journal, October 2004


Parallel -Matrix Arithmetics on Shared Memory Systems
journal, December 2004


Parallel black box $$\mathcal {H}$$ -LU preconditioning for elliptic boundary value problems
journal, April 2008

  • Grasedyck, Lars; Kriemann, Ronald; Le Borne, Sabine
  • Computing and Visualization in Science, Vol. 11, Issue 4-6
  • DOI: 10.1007/s00791-008-0098-9

On the fast matrix multiplication in the boundary element method by panel clustering
journal, July 1989

  • Hackbusch, W.; Nowak, Z. P.
  • Numerische Mathematik, Vol. 54, Issue 4
  • DOI: 10.1007/BF01396324

-matrix approximation of integral operators by interpolation
journal, October 2002


Recompression techniques for adaptive cross approximation
journal, September 2009


A Distributed-Memory Package for Dense Hierarchically Semi-Separable Matrix Computations Using Randomization
journal, June 2016

  • Rouet, François-Henry; Li, Xiaoye S.; Ghysels, Pieter
  • ACM Transactions on Mathematical Software, Vol. 42, Issue 4
  • DOI: 10.1145/2930660

Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs
conference, January 2017

  • Abdelfattah, Ahmad; Haidar, Azzam; Tomov, Stanimire
  • Proceedings of the International Conference on Supercomputing - ICS '17
  • DOI: 10.1145/3079079.3079103

PetRBF — A parallel O(N) algorithm for radial basis function interpolation with Gaussians
journal, May 2010

  • Yokota, Rio; Barba, L. A.; Knepley, Matthew G.
  • Computer Methods in Applied Mechanics and Engineering, Vol. 199, Issue 25-28
  • DOI: 10.1016/j.cma.2010.02.008

An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling
journal, January 2016

  • Ghysels, Pieter; Li, Xiaoye S.; Rouet, François-Henry
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 5
  • DOI: 10.1137/15M1010117