DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A High-Throughput Solver for Marginalized Graph Kernels on GPU

Abstract

Here, we present the design and optimization of a solver for efficient and high-throughput computation of the marginalized graph kernel on General Purpose GPUs. The graph kernel is computed using the conjugate gradient method to solve a generalized Laplacian of the tensor product between a pair of graphs. To cope with the large gap between the instruction throughput and the memory bandwidth of the GPUs, our solver forms the graph tensor product on-the-fly without storing it in memory. This is achieved by using threads in a warp cooperatively to stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. We propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to further exploit sparsity. We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of variousmore » density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales.« less

Authors:
 [1];  [1];  [1];  [1]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1582329
Grant/Contract Number:  
AC02-05CH11231; AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Additional Journal Information:
Journal Name: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS); Journal Volume: 2020; Related Information: see also on arXiv abs/1910.06310
Publisher:
IEEE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; kernel; symmetric matrices; linear systems; mathematical model; tensile stress; task analysis; graphics processing units

Citation Formats

Tang, Yu-Hang, Selvitopi, Oguz, Popovici, Doru Thom, and Buluc, Aydin. A High-Throughput Solver for Marginalized Graph Kernels on GPU. United States: N. p., 2020. Web. doi:10.1109/ipdps47924.2020.00080.
Tang, Yu-Hang, Selvitopi, Oguz, Popovici, Doru Thom, & Buluc, Aydin. A High-Throughput Solver for Marginalized Graph Kernels on GPU. United States. https://doi.org/10.1109/ipdps47924.2020.00080
Tang, Yu-Hang, Selvitopi, Oguz, Popovici, Doru Thom, and Buluc, Aydin. Fri . "A High-Throughput Solver for Marginalized Graph Kernels on GPU". United States. https://doi.org/10.1109/ipdps47924.2020.00080. https://www.osti.gov/servlets/purl/1582329.
@article{osti_1582329,
title = {A High-Throughput Solver for Marginalized Graph Kernels on GPU},
author = {Tang, Yu-Hang and Selvitopi, Oguz and Popovici, Doru Thom and Buluc, Aydin},
abstractNote = {Here, we present the design and optimization of a solver for efficient and high-throughput computation of the marginalized graph kernel on General Purpose GPUs. The graph kernel is computed using the conjugate gradient method to solve a generalized Laplacian of the tensor product between a pair of graphs. To cope with the large gap between the instruction throughput and the memory bandwidth of the GPUs, our solver forms the graph tensor product on-the-fly without storing it in memory. This is achieved by using threads in a warp cooperatively to stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. We propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to further exploit sparsity. We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of various density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales.},
doi = {10.1109/ipdps47924.2020.00080},
journal = {2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
number = ,
volume = 2020,
place = {United States},
year = {Fri May 01 00:00:00 EDT 2020},
month = {Fri May 01 00:00:00 EDT 2020}
}

Works referenced in this record:

The Protein Data Bank
journal, January 2000


A linear-time heuristic for improving network partitions
conference, January 1988

  • Fiduccia, C. M.; Mattheyses, R. M.
  • Papers on Twenty-five years of electronic design automation - 25 years of DAC
  • DOI: 10.1145/62882.62910

Improving performance of sparse matrix-vector multiplication
conference, January 1999

  • Pinar, Ali; Heath, Michael T.
  • Proceedings of the 1999 ACM/IEEE conference on Supercomputing
  • DOI: 10.1145/331532.331562

Accelerating dissipative particle dynamics simulations on GPUs: Algorithms, numerics and applications
journal, November 2014


Think Locally, Act Globally: Highly Balanced Graph Partitioning
book, January 2013


An effective multilevel tabu search approach for balanced graph partitioning
journal, July 2011


The Protein Data Bank
journal, January 2000


DrugBank 5.0: a major update to the DrugBank database for 2018
journal, November 2017

  • Wishart, David S.; Feunang, Yannick D.; Guo, An C.
  • Nucleic Acids Research, Vol. 46, Issue D1
  • DOI: 10.1093/nar/gkx1037

graphkernels: R and Python packages for graph comparison
journal, September 2017


Protein function prediction via graph kernels
journal, June 2005


Roofline: an insightful visual performance model for multicore architectures
journal, April 2009

  • Williams, Samuel; Waterman, Andrew; Patterson, David
  • Communications of the ACM, Vol. 52, Issue 4
  • DOI: 10.1145/1498765.1498785

Prediction of atomization energy using graph kernel and active learning
journal, January 2019

  • Tang, Yu-Hang; de Jong, Wibe A.
  • The Journal of Chemical Physics, Vol. 150, Issue 4
  • DOI: 10.1063/1.5078640

A Recursive Hypergraph Bipartitioning Framework for Reducing Bandwidth and Latency Costs Simultaneously
journal, January 2016

  • Selvitopi, Oguz; Acer, Seher; Aykanat, Cevdet
  • IEEE Transactions on Parallel and Distributed Systems
  • DOI: 10.1109/TPDS.2016.2577024

Automated scientific software scripting with SWIG
journal, July 2003


Cython: The Best of Both Worlds
journal, March 2011

  • Behnel, Stefan; Bradshaw, Robert; Citro, Craig
  • Computing in Science & Engineering, Vol. 13, Issue 2
  • DOI: 10.1109/MCSE.2010.118

Parallel algorithms for tensor product-based inexact graph matching
conference, June 2012

  • Livi, Lorenzo; Rizzi, Antonello
  • The 2012 International Joint Conference on Neural Networks (IJCNN)
  • DOI: 10.1109/IJCNN.2012.6252681

Global alignment of multiple protein interaction networks with application to functional orthology detection
journal, August 2008

  • Singh, R.; Xu, J.; Berger, B.
  • Proceedings of the National Academy of Sciences, Vol. 105, Issue 35
  • DOI: 10.1073/pnas.0806627105

Design of the GraphBLAS API for C
conference, May 2017

  • Buluc, Aydin; Mattson, Tim; McMillan, Scott
  • 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • DOI: 10.1109/IPDPSW.2017.117