skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: GPU code optimization using abstract kernel emulation and sensitivity analysis

Abstract

In this paper, we develop a novel approach to GPU kernel optimization by focusing on identification of bottleneck resources and determining optimization parameters that can alleviate the bottleneck. Performance modeling for GPUs is done by abstract kernel emulation along with latency/gap modeling of resources. Sensitivity analysis with respect to resource latency/gap parameters is used to predict the bottleneck resource for a given kernel's execution. Here, the utility of the bottleneck analysis is demonstrated in two contexts: 1) Enhancing the OpenTuner auto-tuner with the new bottleneck-driven optimization strategy. Experimental results on all kernels from the Rodinia suite and GPU tensor contraction kernels from the NWChem computational chemistry suite demonstrate effectiveness. 2) Manual code optimization: two case studies illustrate the use of the bottleneck analysis to iteratively improve performance of code from state-of-the-art DSL code generators.

Authors:
 [1];  [1];  [1];  [1];  [2];  [3];  [4];  [1]
  1. The Ohio State Univ., Columbus, OH (United States)
  2. Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
  3. Colorado State Univ., Fort Collins, CO (United States)
  4. Univ. Grenoble Alpes, Grenoble (France)
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Wind Energy Technologies Office (EE-4WE); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1582638
Report Number(s):
PNNL-SA-132802
Journal ID: ISSN 0362-1340
Grant/Contract Number:  
AC05-76RL01830
Resource Type:
Accepted Manuscript
Journal Name:
ACM SIGPLAN Notices
Additional Journal Information:
Journal Volume: 53; Journal Issue: 4; Journal ID: ISSN 0362-1340
Publisher:
ACM
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; Performance modeling; GPU; abstract emulation; bottleneck analysis; sensitivity analysis

Citation Formats

Hong, Changwan, Sukumaran-Rajam, Aravind, Kim, Jinsung, Rawat, Prashant Singh, Krishnamoorthy, Sriram, Pouchet, Louis-Noël, Rastello, Fabrice, and Sadayappan, P. GPU code optimization using abstract kernel emulation and sensitivity analysis. United States: N. p., 2018. Web. doi:10.1145/3296979.3192397.
Hong, Changwan, Sukumaran-Rajam, Aravind, Kim, Jinsung, Rawat, Prashant Singh, Krishnamoorthy, Sriram, Pouchet, Louis-Noël, Rastello, Fabrice, & Sadayappan, P. GPU code optimization using abstract kernel emulation and sensitivity analysis. United States. doi:10.1145/3296979.3192397.
Hong, Changwan, Sukumaran-Rajam, Aravind, Kim, Jinsung, Rawat, Prashant Singh, Krishnamoorthy, Sriram, Pouchet, Louis-Noël, Rastello, Fabrice, and Sadayappan, P. Mon . "GPU code optimization using abstract kernel emulation and sensitivity analysis". United States. doi:10.1145/3296979.3192397. https://www.osti.gov/servlets/purl/1582638.
@article{osti_1582638,
title = {GPU code optimization using abstract kernel emulation and sensitivity analysis},
author = {Hong, Changwan and Sukumaran-Rajam, Aravind and Kim, Jinsung and Rawat, Prashant Singh and Krishnamoorthy, Sriram and Pouchet, Louis-Noël and Rastello, Fabrice and Sadayappan, P.},
abstractNote = {In this paper, we develop a novel approach to GPU kernel optimization by focusing on identification of bottleneck resources and determining optimization parameters that can alleviate the bottleneck. Performance modeling for GPUs is done by abstract kernel emulation along with latency/gap modeling of resources. Sensitivity analysis with respect to resource latency/gap parameters is used to predict the bottleneck resource for a given kernel's execution. Here, the utility of the bottleneck analysis is demonstrated in two contexts: 1) Enhancing the OpenTuner auto-tuner with the new bottleneck-driven optimization strategy. Experimental results on all kernels from the Rodinia suite and GPU tensor contraction kernels from the NWChem computational chemistry suite demonstrate effectiveness. 2) Manual code optimization: two case studies illustrate the use of the bottleneck analysis to iteratively improve performance of code from state-of-the-art DSL code generators.},
doi = {10.1145/3296979.3192397},
journal = {ACM SIGPLAN Notices},
number = 4,
volume = 53,
place = {United States},
year = {2018},
month = {6}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:

Works referenced in this record:

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning
conference, January 2017

  • Zhang, Xiuxia; Tan, Guangming; Xue, Shuangbai
  • Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '17
  • DOI: 10.1145/3018743.3018755

A practical automatic polyhedral parallelizer and locality optimizer
conference, January 2008

  • Bondhugula, Uday; Hartono, Albert; Ramanujam, J.
  • Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation - PLDI '08
  • DOI: 10.1145/1375581.1375595

A large-scale cross-architecture evaluation of thread-coarsening
conference, January 2013

  • Magni, Alberto; Dubach, Christophe; O'Boyle, Michael F. P.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
  • DOI: 10.1145/2503210.2503268

Resource Conscious Reuse-Driven Tiling for GPUs
conference, January 2016

  • Rawat, Prashant Singh; Hong, Changwan; Ravishankar, Mahesh
  • Proceedings of the 2016 International Conference on Parallel Architectures and Compilation - PACT '16
  • DOI: 10.1145/2967938.2967967

High-performance code generation for stencil computations on GPU architectures
conference, January 2012

  • Holewinski, Justin; Pouchet, Louis-Noël; Sadayappan, P.
  • Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
  • DOI: 10.1145/2304576.2304619

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
conference, January 2009

  • Hong, Sunpyo; Kim, Hyesoon
  • Proceedings of the 36th annual international symposium on Computer architecture - ISCA '09
  • DOI: 10.1145/1555754.1555775

Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters
conference, September 2010

  • Ma, Wenjing; Krishnamoorthy, Sriram; Villay, Oreste
  • 2010 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2010.26

A performance analysis framework for identifying potential benefits in GPGPU applications
conference, January 2012

  • Sim, Jaewoong; Dasgupta, Aniruddha; Kim, Hyesoon
  • Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12
  • DOI: 10.1145/2145816.2145819

PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures
conference, May 2011

  • Christen, Matthias; Schenk, Olaf; Burkhart, Helmar
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2011.70

Fusing convolution kernels through tiling
conference, January 2015

  • Ravishankar, Mahesh; Micikevicius, Paulius; Grover, Vinod
  • Proceedings of the 2nd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming - ARRAY 2015
  • DOI: 10.1145/2774959.2774965

OpenARC: open accelerator research compiler for directive-based, efficient heterogeneous computing
conference, January 2014

  • Lee, Seyong; Vetter, Jeffrey S.
  • Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14
  • DOI: 10.1145/2600212.2600704

OpenMP: an industry standard API for shared-memory programming
journal, January 1998

  • Dagum, L.; Menon, R.
  • IEEE Computational Science and Engineering, Vol. 5, Issue 1
  • DOI: 10.1109/99.660313

Hybrid Hexagonal/Classical Tiling for GPUs
conference, January 2014

  • Grosser, Tobias; Cohen, Albert; Holewinski, Justin
  • Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization - CGO '14
  • DOI: 10.1145/2581122.2544160

COMPASS: A Framework for Automated Performance Modeling and Prediction
conference, January 2015

  • Lee, Seyong; Meredith, Jeremy S.; Vetter, Jeffrey S.
  • Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15
  • DOI: 10.1145/2751205.2751220

Automatic optimization of thread-coarsening for graphics processors
conference, January 2014

  • Magni, Alberto; Dubach, Christophe; O'Boyle, Michael
  • Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14
  • DOI: 10.1145/2628071.2628087

OpenTuner: an extensible framework for program autotuning
conference, January 2014

  • Ansel, Jason; Kamil, Shoaib; Veeramachaneni, Kalyan
  • Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14
  • DOI: 10.1145/2628071.2628092

GPU code optimization using abstract kernel emulation and sensitivity analysis
conference, January 2018

  • Hong, Changwan; Sukumaran-Rajam, Aravind; Kim, Jinsung
  • Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI 2018
  • DOI: 10.1145/3192366.3192397

Roofline: an insightful visual performance model for multicore architectures
journal, April 2009

  • Williams, Samuel; Waterman, Andrew; Patterson, David
  • Communications of the ACM, Vol. 52, Issue 4
  • DOI: 10.1145/1498765.1498785

Optimizing tensor contraction expressions for hybrid CPU-GPU execution
journal, November 2011


A performance analysis framework for exploiting GPU microarchitectural capability
conference, January 2017

  • Zhou, Keren; Tan, Guangming; Zhang, Xiuxia
  • Proceedings of the International Conference on Supercomputing - ICS '17
  • DOI: 10.1145/3079079.3079083