skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism

Abstract

Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patterns (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performancemore » and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.« less

Authors:
; ; ; ;
Publication Date:
Research Org.:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
Argonne National Laboratory - Argonne Leadership Computing Facility
OSTI Identifier:
1336031
DOE Contract Number:  
AC02-06CH11357
Resource Type:
Conference
Resource Relation:
Conference: 16th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing in conjuction with 29th IEEE International Parallel & Distributed Processing Symposium , 05/29/15 - 05/29/15, Hyderabad, IN
Country of Publication:
United States
Language:
English

Citation Formats

Meng, Jiayuan, Uram, Thomas, Morozov, Vitali A., Vishwanath, Venkatram, and Kumaran, Kalyan. Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism. United States: N. p., 2015. Web. doi:10.1109/IPDPSW.2015.55.
Meng, Jiayuan, Uram, Thomas, Morozov, Vitali A., Vishwanath, Venkatram, & Kumaran, Kalyan. Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism. United States. doi:10.1109/IPDPSW.2015.55.
Meng, Jiayuan, Uram, Thomas, Morozov, Vitali A., Vishwanath, Venkatram, and Kumaran, Kalyan. Thu . "Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism". United States. doi:10.1109/IPDPSW.2015.55.
@article{osti_1336031,
title = {Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism},
author = {Meng, Jiayuan and Uram, Thomas and Morozov, Vitali A. and Vishwanath, Venkatram and Kumaran, Kalyan},
abstractNote = {Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patterns (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.},
doi = {10.1109/IPDPSW.2015.55},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Jan 01 00:00:00 EST 2015},
month = {Thu Jan 01 00:00:00 EST 2015}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: