A High-Throughput Solver for Marginalized Graph Kernels on GPU
Abstract
Here, we present the design and optimization of a solver for efficient and high-throughput computation of the marginalized graph kernel on General Purpose GPUs. The graph kernel is computed using the conjugate gradient method to solve a generalized Laplacian of the tensor product between a pair of graphs. To cope with the large gap between the instruction throughput and the memory bandwidth of the GPUs, our solver forms the graph tensor product on-the-fly without storing it in memory. This is achieved by using threads in a warp cooperatively to stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. We propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to further exploit sparsity. We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of variousmore »
- Authors:
-
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- OSTI Identifier:
- 1582329
- Grant/Contract Number:
- AC02-05CH11231; AC05-00OR22725
- Resource Type:
- Accepted Manuscript
- Journal Name:
- 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- Additional Journal Information:
- Journal Name: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS); Journal Volume: 2020; Related Information: see also on arXiv abs/1910.06310
- Publisher:
- IEEE
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; kernel; symmetric matrices; linear systems; mathematical model; tensile stress; task analysis; graphics processing units
Citation Formats
Tang, Yu-Hang, Selvitopi, Oguz, Popovici, Doru Thom, and Buluc, Aydin. A High-Throughput Solver for Marginalized Graph Kernels on GPU. United States: N. p., 2020.
Web. doi:10.1109/ipdps47924.2020.00080.
Tang, Yu-Hang, Selvitopi, Oguz, Popovici, Doru Thom, & Buluc, Aydin. A High-Throughput Solver for Marginalized Graph Kernels on GPU. United States. https://doi.org/10.1109/ipdps47924.2020.00080
Tang, Yu-Hang, Selvitopi, Oguz, Popovici, Doru Thom, and Buluc, Aydin. Fri .
"A High-Throughput Solver for Marginalized Graph Kernels on GPU". United States. https://doi.org/10.1109/ipdps47924.2020.00080. https://www.osti.gov/servlets/purl/1582329.
@article{osti_1582329,
title = {A High-Throughput Solver for Marginalized Graph Kernels on GPU},
author = {Tang, Yu-Hang and Selvitopi, Oguz and Popovici, Doru Thom and Buluc, Aydin},
abstractNote = {Here, we present the design and optimization of a solver for efficient and high-throughput computation of the marginalized graph kernel on General Purpose GPUs. The graph kernel is computed using the conjugate gradient method to solve a generalized Laplacian of the tensor product between a pair of graphs. To cope with the large gap between the instruction throughput and the memory bandwidth of the GPUs, our solver forms the graph tensor product on-the-fly without storing it in memory. This is achieved by using threads in a warp cooperatively to stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. We propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to further exploit sparsity. We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of various density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales.},
doi = {10.1109/ipdps47924.2020.00080},
journal = {2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
number = ,
volume = 2020,
place = {United States},
year = {Fri May 01 00:00:00 EDT 2020},
month = {Fri May 01 00:00:00 EDT 2020}
}
Works referenced in this record:
A linear-time heuristic for improving network partitions
conference, January 1988
- Fiduccia, C. M.; Mattheyses, R. M.
- Papers on Twenty-five years of electronic design automation - 25 years of DAC
Improving performance of sparse matrix-vector multiplication
conference, January 1999
- Pinar, Ali; Heath, Michael T.
- Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Accelerating dissipative particle dynamics simulations on GPUs: Algorithms, numerics and applications
journal, November 2014
- Tang, Yu-Hang; Karniadakis, George Em
- Computer Physics Communications, Vol. 185, Issue 11
Think Locally, Act Globally: Highly Balanced Graph Partitioning
book, January 2013
- Sanders, Peter; Schulz, Christian
- Experimental Algorithms
An effective multilevel tabu search approach for balanced graph partitioning
journal, July 2011
- Benlic, Una; Hao, Jin-Kao
- Computers & Operations Research, Vol. 38, Issue 7
DrugBank 5.0: a major update to the DrugBank database for 2018
journal, November 2017
- Wishart, David S.; Feunang, Yannick D.; Guo, An C.
- Nucleic Acids Research, Vol. 46, Issue D1
graphkernels: R and Python packages for graph comparison
journal, September 2017
- Sugiyama, Mahito; Ghisu, M. Elisabetta; Llinares-López, Felipe
- Bioinformatics, Vol. 34, Issue 3
Protein function prediction via graph kernels
journal, June 2005
- Borgwardt, K. M.; Ong, C. S.; Schonauer, S.
- Bioinformatics, Vol. 21, Issue Suppl 1
Roofline: an insightful visual performance model for multicore architectures
journal, April 2009
- Williams, Samuel; Waterman, Andrew; Patterson, David
- Communications of the ACM, Vol. 52, Issue 4
Prediction of atomization energy using graph kernel and active learning
journal, January 2019
- Tang, Yu-Hang; de Jong, Wibe A.
- The Journal of Chemical Physics, Vol. 150, Issue 4
A Recursive Hypergraph Bipartitioning Framework for Reducing Bandwidth and Latency Costs Simultaneously
journal, January 2016
- Selvitopi, Oguz; Acer, Seher; Aykanat, Cevdet
- IEEE Transactions on Parallel and Distributed Systems
Automated scientific software scripting with SWIG
journal, July 2003
- Beazley, D. M.
- Future Generation Computer Systems, Vol. 19, Issue 5
Cython: The Best of Both Worlds
journal, March 2011
- Behnel, Stefan; Bradshaw, Robert; Citro, Craig
- Computing in Science & Engineering, Vol. 13, Issue 2
Parallel algorithms for tensor product-based inexact graph matching
conference, June 2012
- Livi, Lorenzo; Rizzi, Antonello
- The 2012 International Joint Conference on Neural Networks (IJCNN)
Global alignment of multiple protein interaction networks with application to functional orthology detection
journal, August 2008
- Singh, R.; Xu, J.; Berger, B.
- Proceedings of the National Academy of Sciences, Vol. 105, Issue 35
Design of the GraphBLAS API for C
conference, May 2017
- Buluc, Aydin; Mattson, Tim; McMillan, Scott
- 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)