Accelerating matrix-centric graph processing on GPUs through bit-level optimizations
- North Carolina State Univ., Raleigh, NC (United States)
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Even though it is well known that binary values are common in graph applications (e.g., adjacency matrix), how to leverage the phenomenon for efficiency has not yet been adequately explored. This paper presents a systematic study on how to unlock the potential of the bit-level optimizations of graph computations that involve binary values. It proposes a two-level representation named Bit-Block Compressed Sparse Row (B2SR) and presents a series of optimizations to the graph operations on B2SR by the intrinsics of modern GPUs. It additionally introduces Deep Reinforcement Learning (DRL) as an efficient way to best configure the bit-level optimizations on the fly. Additionally, the DQN-based adaptive tile size selector with dedicated model training can reach 68% prediction accuracy. Evaluations on NVIDIA Pascal and Volta GPUs show that the optimizations bring up to 40× and 6555× for essential GraphBLAS kernels SpMV and SpGEMM, respectively, making GraphBLAS-based BFS accelerate up to 433×, SSSP, PR, and CC up to 35×, and TC up to 52×.
- Research Organization:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- National Science Foundation (NSF); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- Grant/Contract Number:
- AC05-76RL01830; SC0021293; EE0009357
- OSTI ID:
- 1968852
- Alternate ID(s):
- OSTI ID: 1962681
- Report Number(s):
- PNNL-SA-179122
- Journal Information:
- Journal of Parallel and Distributed Computing, Journal Name: Journal of Parallel and Distributed Computing Vol. 177; ISSN 0743-7315
- Publisher:
- ElsevierCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
A Pattern Based Algorithmic Autotuner for Graph Processing on GPUs
GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM