skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs

Abstract

Wavefront loops are widely used in many scientific applications, e.g., partial differential equation (PDE) solvers and sequence alignment tools. However, due to the data dependencies in wavefront loops, it is challenging to fully utilize the abundant compute units of GPUs and to reuse data through their memory hierarchy. Existing solutions can only optimize for these factors to a limited extent. For example, tiling-based methods optimize memory access but may result in load imbalance; while compensation-based methods, which change the original order of computation to expose more parallelism and then compensate for it, suffer from both global synchronization overhead and limited generality. In this paper, we first prove under which circumstances that breaking data dependencies and properly changing the sequence of computation operators in our compensation-based method does not affect the correctness of results. Based on this analysis, we design a highly efficient compensation-based parallelism on GPUs. Our method provides weighted scan-based GPU kernels to optimize the computation and combines with the tiling method to optimize memory access and synchronization. The performance results on the NVIDIA K80 and P100 GPU platforms demonstrate that our method can achieve significant improvements for four types of real-world application kernels over the state-of-the-art research.

Authors:
 [1];  [1];  [1]; ORCiD logo [2]; ORCiD logo [2]
  1. Virginia Tech, Blacksburg, VA
  2. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1474547
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018) - Vancouver, , Canada - 5/21/2018 8:00:00 AM-5/25/2018 8:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Hou, Kaixi, Wang, Hao, Feng, Wu-Chun, Vetter, Jeffrey S., and Lee, Seyong. Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs. United States: N. p., 2018. Web. doi:10.1109/IPDPS.2018.00037.
Hou, Kaixi, Wang, Hao, Feng, Wu-Chun, Vetter, Jeffrey S., & Lee, Seyong. Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs. United States. doi:10.1109/IPDPS.2018.00037.
Hou, Kaixi, Wang, Hao, Feng, Wu-Chun, Vetter, Jeffrey S., and Lee, Seyong. Tue . "Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs". United States. doi:10.1109/IPDPS.2018.00037. https://www.osti.gov/servlets/purl/1474547.
@article{osti_1474547,
title = {Highly Efficient Compensation-Based Parallelism for Wavefront Loops on GPUs},
author = {Hou, Kaixi and Wang, Hao and Feng, Wu-Chun and Vetter, Jeffrey S. and Lee, Seyong},
abstractNote = {Wavefront loops are widely used in many scientific applications, e.g., partial differential equation (PDE) solvers and sequence alignment tools. However, due to the data dependencies in wavefront loops, it is challenging to fully utilize the abundant compute units of GPUs and to reuse data through their memory hierarchy. Existing solutions can only optimize for these factors to a limited extent. For example, tiling-based methods optimize memory access but may result in load imbalance; while compensation-based methods, which change the original order of computation to expose more parallelism and then compensate for it, suffer from both global synchronization overhead and limited generality. In this paper, we first prove under which circumstances that breaking data dependencies and properly changing the sequence of computation operators in our compensation-based method does not affect the correctness of results. Based on this analysis, we design a highly efficient compensation-based parallelism on GPUs. Our method provides weighted scan-based GPU kernels to optimize the computation and combines with the tiling method to optimize memory access and synchronization. The performance results on the NVIDIA K80 and P100 GPU platforms demonstrate that our method can achieve significant improvements for four types of real-world application kernels over the state-of-the-art research.},
doi = {10.1109/IPDPS.2018.00037},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2018},
month = {5}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: