A High-Throughput Solver for Marginalized Graph Kernels on GPU

Tang, Yu-Hang; Selvitopi, Oguz; Popovici, Doru Thom; Buluc, Aydin

doi:10.1109/ipdps47924.2020.00080

Title: A High-Throughput Solver for Marginalized Graph Kernels on GPU

Abstract

Here, we present the design and optimization of a solver for efficient and high-throughput computation of the marginalized graph kernel on General Purpose GPUs. The graph kernel is computed using the conjugate gradient method to solve a generalized Laplacian of the tensor product between a pair of graphs. To cope with the large gap between the instruction throughput and the memory bandwidth of the GPUs, our solver forms the graph tensor product on-the-fly without storing it in memory. This is achieved by using threads in a warp cooperatively to stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. We propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to further exploit sparsity. We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of variousmore »« less

Authors:

Tang, Yu-Hang ^[1]; Selvitopi, Oguz ^[1]; Popovici, Doru Thom ^[1]; Buluc, Aydin ^[1]

Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Publication Date:: Fri May 01 00:00:00 EDT 2020

Research Org.:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Org.:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

OSTI Identifier:: 1582329

Grant/Contract Number:: AC02-05CH11231; AC05-00OR22725

Resource Type:: Accepted Manuscript

Journal Name:: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Additional Journal Information:: Journal Name: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS); Journal Volume: 2020; Related Information: see also on arXiv abs/1910.06310

Publisher:: IEEE

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; kernel; symmetric matrices; linear systems; mathematical model; tensile stress; task analysis; graphics processing units

Citation Formats


                    Tang, Yu-Hang, Selvitopi, Oguz, Popovici, Doru Thom, and Buluc, Aydin. A High-Throughput Solver for Marginalized Graph Kernels on GPU.  United States: N. p., 2020. 
Web.  doi:10.1109/ipdps47924.2020.00080.

Copy to clipboard


                    Tang, Yu-Hang, Selvitopi, Oguz, Popovici, Doru Thom, & Buluc, Aydin. A High-Throughput Solver for Marginalized Graph Kernels on GPU.  United States.  https://doi.org/10.1109/ipdps47924.2020.00080

Copy to clipboard


                    Tang, Yu-Hang, Selvitopi, Oguz, Popovici, Doru Thom, and Buluc, Aydin. Fri .  
"A High-Throughput Solver for Marginalized Graph Kernels on GPU".  United States.  https://doi.org/10.1109/ipdps47924.2020.00080.  https://www.osti.gov/servlets/purl/1582329.

Copy to clipboard


                    
@article{osti_1582329,

  title        = {A High-Throughput Solver for Marginalized Graph Kernels on GPU},

  author       = {Tang, Yu-Hang and Selvitopi, Oguz and Popovici, Doru Thom and Buluc, Aydin},

  abstractNote = {Here, we present the design and optimization of a solver for efficient and high-throughput computation of the marginalized graph kernel on General Purpose GPUs. The graph kernel is computed using the conjugate gradient method to solve a generalized Laplacian of the tensor product between a pair of graphs. To cope with the large gap between the instruction throughput and the memory bandwidth of the GPUs, our solver forms the graph tensor product on-the-fly without storing it in memory. This is achieved by using threads in a warp cooperatively to stream the adjacency and edge label matrices of individual graphs by small square matrix blocks called tiles, which are then staged in registers and the shared memory for later reuse. Warps across a thread block can further share tiles via the shared memory to increase data reuse. We exploit the sparsity of the graphs hierarchically by storing only non-empty tiles using a coordinate format and nonzero elements within each tile using bitmaps. We propose a new partition-based reordering algorithm for aggregating nonzero elements of the graphs into fewer but denser tiles to further exploit sparsity. We carry out extensive theoretical analyses on the graph tensor product primitives for tiles of various density and evaluate their performance on synthetic and real-world datasets. Our solver delivers three to four orders of magnitude speedup over existing CPU-based solvers such as GraKeL and GraphKernels. The capability of the solver enables kernel-based learning tasks at unprecedented scales.},

  doi          = {10.1109/ipdps47924.2020.00080},

  journal      = {2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},

  number       = ,

  volume       = 2020,

  place        = {United States},

  year         = {Fri May 01 00:00:00 EDT 2020},

  month        = {Fri May 01 00:00:00 EDT 2020}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1109/ipdps47924.2020.00080

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

The Protein Data Bank
journal, January 2000

Berman, H. M.
Nucleic Acids Research, Vol. 28, Issue 1
DOI: 10.1093/nar/28.1.235

A linear-time heuristic for improving network partitions
conference, January 1988

Fiduccia, C. M.; Mattheyses, R. M.
Papers on Twenty-five years of electronic design automation - 25 years of DAC
DOI: 10.1145/62882.62910

Improving performance of sparse matrix-vector multiplication
conference, January 1999

Pinar, Ali; Heath, Michael T.
Proceedings of the 1999 ACM/IEEE conference on Supercomputing
DOI: 10.1145/331532.331562

Accelerating dissipative particle dynamics simulations on GPUs: Algorithms, numerics and applications
journal, November 2014

Tang, Yu-Hang; Karniadakis, George Em
Computer Physics Communications, Vol. 185, Issue 11
DOI: 10.1016/j.cpc.2014.06.015

Think Locally, Act Globally: Highly Balanced Graph Partitioning
book, January 2013

Sanders, Peter; Schulz, Christian
Experimental Algorithms
DOI: 10.1007/978-3-642-38527-8_16

An effective multilevel tabu search approach for balanced graph partitioning
journal, July 2011

Benlic, Una; Hao, Jin-Kao
Computers & Operations Research, Vol. 38, Issue 7
DOI: 10.1016/j.cor.2010.10.007

The Protein Data Bank
journal, January 2000

Berman, H. M.
Nucleic Acids Research, Vol. 28, Issue 1
DOI: 10.1093/nar/28.1.235

DrugBank 5.0: a major update to the DrugBank database for 2018
journal, November 2017

Wishart, David S.; Feunang, Yannick D.; Guo, An C.
Nucleic Acids Research, Vol. 46, Issue D1
DOI: 10.1093/nar/gkx1037

graphkernels: R and Python packages for graph comparison
journal, September 2017

Sugiyama, Mahito; Ghisu, M. Elisabetta; Llinares-López, Felipe
Bioinformatics, Vol. 34, Issue 3
DOI: 10.1093/bioinformatics/btx602

Protein function prediction via graph kernels
journal, June 2005

Borgwardt, K. M.; Ong, C. S.; Schonauer, S.
Bioinformatics, Vol. 21, Issue Suppl 1
DOI: 10.1093/bioinformatics/bti1007

Roofline: an insightful visual performance model for multicore architectures
journal, April 2009

Williams, Samuel; Waterman, Andrew; Patterson, David
Communications of the ACM, Vol. 52, Issue 4
DOI: 10.1145/1498765.1498785

Prediction of atomization energy using graph kernel and active learning
journal, January 2019

Tang, Yu-Hang; de Jong, Wibe A.
The Journal of Chemical Physics, Vol. 150, Issue 4
DOI: 10.1063/1.5078640

A Recursive Hypergraph Bipartitioning Framework for Reducing Bandwidth and Latency Costs Simultaneously
journal, January 2016

Selvitopi, Oguz; Acer, Seher; Aykanat, Cevdet
IEEE Transactions on Parallel and Distributed Systems
DOI: 10.1109/TPDS.2016.2577024

Automated scientific software scripting with SWIG
journal, July 2003

Beazley, D. M.
Future Generation Computer Systems, Vol. 19, Issue 5
DOI: 10.1016/S0167-739X(02)00171-1

Cython: The Best of Both Worlds
journal, March 2011

Behnel, Stefan; Bradshaw, Robert; Citro, Craig
Computing in Science & Engineering, Vol. 13, Issue 2
DOI: 10.1109/MCSE.2010.118

Parallel algorithms for tensor product-based inexact graph matching
conference, June 2012

Livi, Lorenzo; Rizzi, Antonello
The 2012 International Joint Conference on Neural Networks (IJCNN)
DOI: 10.1109/IJCNN.2012.6252681

Global alignment of multiple protein interaction networks with application to functional orthology detection
journal, August 2008

Singh, R.; Xu, J.; Berger, B.
Proceedings of the National Academy of Sciences, Vol. 105, Issue 35
DOI: 10.1073/pnas.0806627105

Design of the GraphBLAS API for C
conference, May 2017

Buluc, Aydin; Mattson, Tim; McMillan, Scott
2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
DOI: 10.1109/IPDPSW.2017.117

Similar Records in DOE PAGES and OSTI.GOV collections:

Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report Shen, Xipeng

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardwaremore »« less
https://doi.org/10.2172/1576175

Full Text Available
Power/Performance Trade-offs of Small Batched LU Based Solvers on GPUs

Conference Villa, Oreste ; Fatica, Massimiliano ; Gawande, Nitin A. ; ...

In this paper we propose and analyze a set of batched linear solvers for small matrices on Graphic Processing Units (GPUs), evaluating the various alternatives depending on the size of the systems to solve. We discuss three different solutions that operate with different level of parallelization and GPU features. The first, exploiting the CUBLAS library, manages matrices of size up to 32x32 and employs Warp level (one matrix, one Warp) parallelism and shared memory. The second works at Thread-block level parallelism (one matrix, one Thread-block), still exploiting shared memory but managing matrices up to 76x76. The third is Thread levelmore »« less
https://doi.org/10.1007/978-3-642-40047-6_81
Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs

Conference Kim, Jinsung ; Sukumaran-Rajan, Aravind ; Hong, Changwan ; ...

Tensor contractions are higher dimensional analogs of matrix multiplications, used in many computational contexts such as high order models in quantum chemistry, deep learning, nite element methods etc. In contrast to the wide availability of high-performance libraries for matrix multiplication on GPUs, the same is not true for tensor contractions. In this paper, we address the optimization of a set of symmetrized tensor contractions that form the computational bottleneck in the CCSD(T) coupled-cluster method in computational chemistry suites like NWChem. Some of the challenges in optimizing tensor contractions that arise in practice from the variety of dimensionalities and shapes formore »« less
https://doi.org/10.1145/3205289.3205296
Tensor Contraction and Operation Minimization forExtreme Scale Computational Chemistry

Technical Report Sabin, Gerald ; Sadayappan, P.

During Phase 1, RNET and U. Utah accomplished four main tasks to demonstrate the technical and practical feasibility of the ES-SciLA tensor library. Evaluated the SpMM runtime performance using the KokkosKernels framework, i.e., the multi-vector 2D Tensor KokkosSparse :: spmv interface. KokkosKernels represents a high productivity library that is an important part of the DOE exascale platform. It provides shared memory performance portability across a range of hardware architectures and is a target for planned future integration is ES-SciLA. The results include matrices from DARPA’s SDNN Graph Challenge, a Transformer model [8], structural and power law matrices from the Suitemore »« less
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

Conference Yang, Carl ; Buluc, Aydin ; Owens, John D.

High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs, because of three challenges: (1) difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based in sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear algebra-based approach to building graph frameworksmore »« less

Similar Records

Title: A High-Throughput Solver for Marginalized Graph Kernels on GPU

Abstract

Citation Formats

The Protein Data Bank journal, January 2000

A linear-time heuristic for improving network partitions conference, January 1988

Improving performance of sparse matrix-vector multiplication conference, January 1999

Accelerating dissipative particle dynamics simulations on GPUs: Algorithms, numerics and applications journal, November 2014

Think Locally, Act Globally: Highly Balanced Graph Partitioning book, January 2013

An effective multilevel tabu search approach for balanced graph partitioning journal, July 2011

The Protein Data Bank journal, January 2000

DrugBank 5.0: a major update to the DrugBank database for 2018 journal, November 2017

graphkernels: R and Python packages for graph comparison journal, September 2017

Protein function prediction via graph kernels journal, June 2005

Roofline: an insightful visual performance model for multicore architectures journal, April 2009

Prediction of atomization energy using graph kernel and active learning journal, January 2019

A Recursive Hypergraph Bipartitioning Framework for Reducing Bandwidth and Latency Costs Simultaneously journal, January 2016

Automated scientific software scripting with SWIG journal, July 2003

Cython: The Best of Both Worlds journal, March 2011

Parallel algorithms for tensor product-based inexact graph matching conference, June 2012

Global alignment of multiple protein interaction networks with application to functional orthology detection journal, August 2008

Design of the GraphBLAS API for C conference, May 2017

The Protein Data Bank
journal, January 2000

A linear-time heuristic for improving network partitions
conference, January 1988

Improving performance of sparse matrix-vector multiplication
conference, January 1999

Accelerating dissipative particle dynamics simulations on GPUs: Algorithms, numerics and applications
journal, November 2014

Think Locally, Act Globally: Highly Balanced Graph Partitioning
book, January 2013

An effective multilevel tabu search approach for balanced graph partitioning
journal, July 2011

The Protein Data Bank
journal, January 2000

DrugBank 5.0: a major update to the DrugBank database for 2018
journal, November 2017

graphkernels: R and Python packages for graph comparison
journal, September 2017

Protein function prediction via graph kernels
journal, June 2005

Roofline: an insightful visual performance model for multicore architectures
journal, April 2009

Prediction of atomization energy using graph kernel and active learning
journal, January 2019

A Recursive Hypergraph Bipartitioning Framework for Reducing Bandwidth and Latency Costs Simultaneously
journal, January 2016

Automated scientific software scripting with SWIG
journal, July 2003

Cython: The Best of Both Worlds
journal, March 2011

Parallel algorithms for tensor product-based inexact graph matching
conference, June 2012

Global alignment of multiple protein interaction networks with application to functional orthology detection
journal, August 2008

Design of the GraphBLAS API for C
conference, May 2017