Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs

Conference ·
Wavefront loops are widely used in many scientific applications, e.g., partial differential equation (PDE) solvers and sequence alignment tools. However, due to the data dependencies in wavefront loops, it is challenging to fully utilize the abundant compute units of GPUs and to reuse data through their memory hierarchy. Existing solutions can only optimize for these factors to a limited extent. For example, tiling-based methods optimize memory access but may result in load imbalance; while compensation-based methods, which change the original order of computation to expose more parallelism and then compensate for it, suffer from both global synchronization overhead and limited generality. In this paper, we first prove under which circumstances that breaking data dependencies and properly changing the sequence of computation operators in our compensation-based method does not affect the correctness of results. Based on this analysis, we design a highly efficient compensation-based parallelism on GPUs. Our method provides weighted scan-based GPU kernels to optimize the computation and combines with the tiling method to optimize memory access and synchronization. The performance results on the NVIDIA K80 and P100 GPU platforms demonstrate that our method can achieve significant improvements for four types of real-world application kernels over the state-of-the-art research.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1474547
Country of Publication:
United States
Language:
English

References (22)

GPU-UniCache conference May 2017
StreamScan: fast scan algorithms for GPUs without global barrier synchronization conference January 2013
Integral histogram: a fast way to extract histograms in Cartesian spaces conference January 2005
Loops skewing: The wavefront method revisited journal August 1986
An Evaluation of Vectorizing Compilers conference October 2011
AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-Based Multi-and Many-Core Processors conference May 2016
Combining SIMD and Many/Multi-core Parallelism for Finite State Machines with Enumerative Speculation conference January 2017
swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures
  • Wang, Xinliang; Liu, Weifeng; Xue, Wei
  • PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3178487.3178513
conference February 2018
Model-Driven Tile Size Selection for DOACROSS Loops on GPUs book January 2011
On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit conference January 2009
Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs conference September 2012
Fast segmented sort on GPUs conference January 2017
PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization conference January 2015
ASPaS conference June 2015
Tiling and optimizing time-iterated computations on periodic domains conference August 2014
Dymaxion: optimizing memory access patterns for heterogeneous systems
  • Che, Shuai; Sheaffer, Jeremy W.; Skadron, Kevin
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063401
conference January 2011
Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms conference May 2017
Striped Smith-Waterman speeds database searches six times over other SIMD implementations journal November 2006
Identification of common molecular subsequences journal March 1981
Exploiting wavefront parallelism on large-scale shared-memory multiprocessors journal March 2001
A Framework for the Automatic Vectorization of Parallel Sort on x86-Based Processors journal May 2018
Acceleration of the Smith–Waterman algorithm using single and multiple graphics processors journal June 2010

Similar Records

GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM
Conference · Thu Nov 30 23:00:00 EST 2017 · OSTI ID:1427708

Evaluating LULESH Kernels on OpenCL FPGA
Conference · Mon Dec 31 23:00:00 EST 2018 · OSTI ID:1528953

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures
Conference · Mon Aug 09 00:00:00 EDT 2021 · OSTI ID:1830211

Related Subjects