Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs

Hou, Kaixi; Wang, Hao; Feng, Wu-Chun; Vetter, Jeffrey; Lee, Seyong

doi:10.1109/IPDPS.2018.00037

Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs

Conference · Tue May 01 04:00:00 EDT 2018

DOI:https://doi.org/10.1109/IPDPS.2018.00037· OSTI ID:1474547

Hou, Kaixi ^[1]; Wang, Hao ^[1]; Feng, Wu-Chun ^[1]; ^[2]; ^[2]

Virginia Tech, Blacksburg, VA
ORNL

Wavefront loops are widely used in many scientific applications, e.g., partial differential equation (PDE) solvers and sequence alignment tools. However, due to the data dependencies in wavefront loops, it is challenging to fully utilize the abundant compute units of GPUs and to reuse data through their memory hierarchy. Existing solutions can only optimize for these factors to a limited extent. For example, tiling-based methods optimize memory access but may result in load imbalance; while compensation-based methods, which change the original order of computation to expose more parallelism and then compensate for it, suffer from both global synchronization overhead and limited generality. In this paper, we first prove under which circumstances that breaking data dependencies and properly changing the sequence of computation operators in our compensation-based method does not affect the correctness of results. Based on this analysis, we design a highly efficient compensation-based parallelism on GPUs. Our method provides weighted scan-based GPU kernels to optimize the computation and combines with the tiling method to optimize memory access and synchronization. The performance results on the NVIDIA K80 and P100 GPU platforms demonstrate that our method can achieve significant improvements for four types of real-world application kernels over the state-of-the-art research.

View Conference

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1474547

Country of Publication:: United States

Language:: English

References (22)

GPU-UniCache Hou, Kaixi; Wang, Hao; Feng, Wu-chun Proceedings of the Computing Frontiers Conference https://doi.org/10.1145/3075564.3075583	conference	May 2017
StreamScan: fast scan algorithms for GPUs without global barrier synchronization Yan, Shengen; Long, Guoping; Zhang, Yunquan Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13 https://doi.org/10.1145/2442516.2442539	conference	January 2013
Integral histogram: a fast way to extract histograms in Cartesian spaces Porikli, F. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) https://doi.org/10.1109/CVPR.2005.188	conference	January 2005
Loops skewing: The wavefront method revisited Wolfe, Michael International Journal of Parallel Programming, Vol. 15, Issue 4 https://doi.org/10.1007/BF01407876	journal	August 1986
An Evaluation of Vectorizing Compilers Maleki, Saeed; Gao, Yaoqing; Garzar´n, Maria J. 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT) https://doi.org/10.1109/PACT.2011.68	conference	October 2011
AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-Based Multi-and Many-Core Processors Hou, Kaixi; Wang, Hao; Feng, Wu-Chun 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2016.115	conference	May 2016
Combining SIMD and Many/Multi-core Parallelism for Finite State Machines with Enumerative Speculation Jiang, Peng; Agrawal, Gagan Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3018743.3018760	conference	January 2017
swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures Wang, Xinliang; Liu, Weifeng; Xue, Wei PPoPP '18: 23nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming https://doi.org/10.1145/3178487.3178513	conference	February 2018
Model-Driven Tile Size Selection for DOACROSS Loops on GPUs Di, Peng; Xue, Jingling Euro-Par 2011 Parallel Processing https://doi.org/10.1007/978-3-642-23397-5_40	book	January 2011
On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit Xiao, Shucai; Aji, Ashwin M.; Feng, Wu-chun 2009 15th International Conference on Parallel and Distributed Systems https://doi.org/10.1109/ICPADS.2009.110	conference	January 2009
Automatic Parallelization of Tiled Loop Nests with Enhanced Fine-Grained Parallelism on GPUs Di, Peng; Ye, Ding; Su, Yu 2012 41st International Conference on Parallel Processing https://doi.org/10.1109/ICPP.2012.19	conference	September 2012
Fast segmented sort on GPUs Hou, Kaixi; Liu, Weifeng; Wang, Hao Proceedings of the International Conference on Supercomputing - ICS '17 https://doi.org/10.1145/3079079.3079105	conference	January 2017
PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization Belviranli, Mehmet E.; Deng, Peng; Bhuyan, Laxmi N. Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15 https://doi.org/10.1145/2751205.2751243	conference	January 2015
ASPaS Hou, Kaixi; Wang, Hao; Feng, Wu-chun Proceedings of the 29th ACM on International Conference on Supercomputing https://doi.org/10.1145/2751205.2751247	conference	June 2015
Tiling and optimizing time-iterated computations on periodic domains Bondhugula, Uday; Bandishti, Vinayaka; Cohen, Albert Proceedings of the 23rd international conference on Parallel architectures and compilation https://doi.org/10.1145/2628071.2628106	conference	August 2014
Dymaxion: optimizing memory access patterns for heterogeneous systems Che, Shuai; Sheaffer, Jeremy W.; Skadron, Kevin Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063401	conference	January 2011
Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms Wang, Jie; Xie, Xinfeng; Cong, Jason 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2017.79	conference	May 2017
Striped Smith-Waterman speeds database searches six times over other SIMD implementations Farrar, M. Bioinformatics, Vol. 23, Issue 2 https://doi.org/10.1093/bioinformatics/btl582	journal	November 2006
Identification of common molecular subsequences Smith, T. F.; Waterman, M. S. Journal of Molecular Biology, Vol. 147, Issue 1, p. 195-197 https://doi.org/10.1016/0022-2836(81)90087-5	journal	March 1981
Exploiting wavefront parallelism on large-scale shared-memory multiprocessors Manjikian, N.; Abdelrahman, T. S. IEEE Transactions on Parallel and Distributed Systems, Vol. 12, Issue 3 https://doi.org/10.1109/71.914756	journal	March 2001
A Framework for the Automatic Vectorization of Parallel Sort on x86-Based Processors Hou, Kaixi; Wang, Hao; Feng, Wu-Chun IEEE Transactions on Parallel and Distributed Systems, Vol. 29, Issue 5 https://doi.org/10.1109/TPDS.2018.2789903	journal	May 2018
Acceleration of the Smith–Waterman algorithm using single and multiple graphics processors Khajeh-Saeed, Ali; Poole, Stephen; Blair Perot, J. Journal of Computational Physics, Vol. 229, Issue 11 https://doi.org/10.1016/j.jcp.2010.02.009	journal	June 2010

Similar Records

GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

Conference · Thu Nov 30 23:00:00 EST 2017 · OSTI ID:1427708

Evaluating LULESH Kernels on OpenCL FPGA

Conference · Mon Dec 31 23:00:00 EST 2018 · OSTI ID:1528953

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

Conference · Mon Aug 09 00:00:00 EDT 2021 · OSTI ID:1830211

Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs

Citation Formats

References (22)

Similar Records

Related Subjects