skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Panda: A Compiler Framework for Concurrent CPU $+$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Abstract

We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of directives, sequential stencil C codes can be automatically parallelized for large-scale GPU clusters. The most distinctive feature of the compiler is its capability to generate hybrid MPI$+$ CUDA$+$ OpenMP code that uses concurrent CPU$+$ GPU computing to unleash the full potential of powerful GPU clusters. The auto-generated hybrid codes hide the overhead of various data motion by overlapping them with computation. Test results on the Titan supercomputer and the Wilkes cluster show that auto-translated codes can achieve about 90 % of the performance of highly optimized handwritten codes, for both a simple stencil benchmark and a real-world application in cardiac modeling. The user-friendliness and performance of our domain-specific compiler framework allow harnessing the full power of GPU-accelerated supercomputing without painstaking coding effort.

Authors:
ORCiD logo [1];  [2];  [1]
  1. Simula Research Lab., Oslo (Norway); Univ. of Oslo (Norway)
  2. Univ. of California, San Diego, CA (United States)
Publication Date:
Research Org.:
Oak Ridge National Laboratory, Oak Ridge Leadership Computing Facility (OLCF); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1525220
Grant/Contract Number:  
AC02-05CH11231
Resource Type:
Accepted Manuscript
Journal Name:
International Journal of Parallel Programming
Additional Journal Information:
Journal Volume: 45; Journal Issue: 3; Journal ID: ISSN 0885-7458
Publisher:
Springer
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Sourouri, Mohammed, Baden, Scott B., and Cai, Xing. Panda: A Compiler Framework for Concurrent CPU $+$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers. United States: N. p., 2016. Web. doi:10.1007/s10766-016-0454-1.
Sourouri, Mohammed, Baden, Scott B., & Cai, Xing. Panda: A Compiler Framework for Concurrent CPU $+$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers. United States. doi:10.1007/s10766-016-0454-1.
Sourouri, Mohammed, Baden, Scott B., and Cai, Xing. Wed . "Panda: A Compiler Framework for Concurrent CPU $+$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers". United States. doi:10.1007/s10766-016-0454-1. https://www.osti.gov/servlets/purl/1525220.
@article{osti_1525220,
title = {Panda: A Compiler Framework for Concurrent CPU $+$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers},
author = {Sourouri, Mohammed and Baden, Scott B. and Cai, Xing},
abstractNote = {We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of directives, sequential stencil C codes can be automatically parallelized for large-scale GPU clusters. The most distinctive feature of the compiler is its capability to generate hybrid MPI$+$ CUDA$+$ OpenMP code that uses concurrent CPU$+$ GPU computing to unleash the full potential of powerful GPU clusters. The auto-generated hybrid codes hide the overhead of various data motion by overlapping them with computation. Test results on the Titan supercomputer and the Wilkes cluster show that auto-translated codes can achieve about 90 % of the performance of highly optimized handwritten codes, for both a simple stencil benchmark and a real-world application in cardiac modeling. The user-friendliness and performance of our domain-specific compiler framework allow harnessing the full power of GPU-accelerated supercomputing without painstaking coding effort.},
doi = {10.1007/s10766-016-0454-1},
journal = {International Journal of Parallel Programming},
number = 3,
volume = 45,
place = {United States},
year = {2016},
month = {10}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 3 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

A Survey of CPU-GPU Heterogeneous Computing Techniques
journal, July 2015

  • Mittal, Sparsh; Vetter, Jeffrey S.
  • ACM Computing Surveys, Vol. 47, Issue 4
  • DOI: 10.1145/2788396

PARTANS: An autotuning framework for stencil computation on multi-GPU systems
journal, January 2013

  • Lutz, Thibaut; Fensch, Christian; Cole, Murray
  • ACM Transactions on Architecture and Code Optimization, Vol. 9, Issue 4
  • DOI: 10.1145/2400682.2400718

High Performance Stencil Code Algorithms for GPGPUs
journal, January 2011


Scalable Heterogeneous CPU-GPU Computations for Unstructured Tetrahedral Meshes
journal, July 2015

  • Langguth, Johannes; Sourouri, Mohammed; Lines, Glenn Terje
  • IEEE Micro, Vol. 35, Issue 4
  • DOI: 10.1109/MM.2015.70

Roofline: an insightful visual performance model for multicore architectures
journal, April 2009

  • Williams, Samuel; Waterman, Andrew; Patterson, David
  • Communications of the ACM, Vol. 52, Issue 4
  • DOI: 10.1145/1498765.1498785