DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Panda: A Compiler Framework for Concurrent CPU $$+$$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Journal Article · · International Journal of Parallel Programming

We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of directives, sequential stencil C codes can be automatically parallelized for large-scale GPU clusters. The most distinctive feature of the compiler is its capability to generate hybrid MPI$$+$$ CUDA$$+$$ OpenMP code that uses concurrent CPU$$+$$ GPU computing to unleash the full potential of powerful GPU clusters. The auto-generated hybrid codes hide the overhead of various data motion by overlapping them with computation. Test results on the Titan supercomputer and the Wilkes cluster show that auto-translated codes can achieve about 90 % of the performance of highly optimized handwritten codes, for both a simple stencil benchmark and a real-world application in cardiac modeling. The user-friendliness and performance of our domain-specific compiler framework allow harnessing the full power of GPU-accelerated supercomputing without painstaking coding effort.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
Grant/Contract Number:
Journal Information:
International Journal of Parallel Programming, Vol. 45, Issue 3; ISSN 0885-7458
SpringerCopyright Statement
Country of Publication:
United States
Citation Metrics:
Cited by: 14 works
Citation information provided by
Web of Science

References (31)

Automatic C-to-CUDA Code Generation for Affine Programs book January 2010
An auto-tuning framework for parallel multicore stencil computations conference April 2010
High-performance code generation for stencil computations on GPU architectures conference January 2012
Mint: realizing CUDA performance in 3D stencil methods with annotated C conference January 2011
A Survey of CPU-GPU Heterogeneous Computing Techniques journal July 2015
CPU+GPU Programming of Stencil Computations for Resource-Efficient Use of GPU Clusters
  • Sourouri, Mohammed; Langguth, Johannes; Spiga, Filippo
  • 2015 IEEE 18th International Conference on Computational Science and Engineering (CSE)
conference October 2015
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers
  • Maruyama, Naoya; Nomura, Tatsuo; Sato, Kento
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
conference January 2011
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
  • Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew
  • Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13
conference January 2013
PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures
  • Christen, Matthias; Schenk, Olaf; Burkhart, Helmar
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium
conference May 2011
A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers
  • Olschanowsky, Catherine; Strout, Michelle Mills; Guzik, Stephen
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
conference November 2014
Towards automatic translation of OpenMP to MPI conference January 2005
Understanding stencil code performance on multicore architectures conference January 2011
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters conference January 2012
Early evaluation of directive-based GPU programming models for productive exascale computing
  • Lee, Seyong; Vetter, Jeffrey S.
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
conference November 2012
Abstract Machine Models and Proxy Architectures for Exascale Computing conference November 2014
Distributed memory code generation for mixed Irregular/Regular computations
  • Ravishankar, Mahesh; Dathathri, Roshan; Elango, Venmugil
  • Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP 2015
conference January 2015
PARTANS: An autotuning framework for stencil computation on multi-GPU systems journal January 2013
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems conference January 2009
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
  • Lee, Seyong; Eigenmann, Rudolf
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
conference November 2010
High Performance Stencil Code Algorithms for GPGPUs journal January 2011
STELLA: a domain-specific tool for structured grid methods in weather and climate models
  • Gysi, Tobias; Osuna, Carlos; Fuhrer, Oliver
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
conference January 2015
Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer
  • Shimokawabe, Takashi; Aoki, Takayuki; Takaki, Tomohiro
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
conference January 2011
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters conference January 2012
Hybrid Hexagonal/Classical Tiling for GPUs conference January 2014
Scalable Heterogeneous CPU-GPU Computations for Unstructured Tetrahedral Meshes journal July 2015
Optimization of geometric multigrid for emerging multi- and manycore processors
  • Williams, Samuel; Kalamkar, Dhiraj D.; Singh, Amik
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
conference November 2012
On the GPU Performance of 3D Stencil Computations Implemented in OpenCL book January 2013
Roofline: an insightful visual performance model for multicore architectures journal April 2009
OpenACC — First Experiences with Real-World Applications book January 2012
Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond
  • Levesque, John M.; Sankaran, Ramanan; Grout, Ray
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
conference November 2012
High-Productivity Framework on GPU-Rich Supercomputers for Operational Weather Prediction Code ASUCA
  • Shimokawabe, Takashi; Aoki, Takayuki; Onodera, Naoyuki
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
conference November 2014

Cited By (1)

Domain-Specific Multi-Level IR Rewriting for GPU preprint January 2020

Figures / Tables (6)