Panda: A Compiler Framework for Concurrent CPU $+$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Sourouri, Mohammed; Baden, Scott B.; Cai, Xing

doi:10.1007/s10766-016-0454-1

Title: Panda: A Compiler Framework for Concurrent CPU $$+$$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Abstract

We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of directives, sequential stencil C codes can be automatically parallelized for large-scale GPU clusters. The most distinctive feature of the compiler is its capability to generate hybrid MPI$$+$$ CUDA$$+$$ OpenMP code that uses concurrent CPU$$+$$ GPU computing to unleash the full potential of powerful GPU clusters. The auto-generated hybrid codes hide the overhead of various data motion by overlapping them with computation. Test results on the Titan supercomputer and the Wilkes cluster show that auto-translated codes can achieve about 90 % of the performance of highly optimized handwritten codes, for both a simple stencil benchmark and a real-world application in cardiac modeling. The user-friendliness and performance of our domain-specific compiler framework allow harnessing the full power of GPU-accelerated supercomputing without painstaking coding effort.

Authors:

^[1]; Baden, Scott B. ^[2]; Cai, Xing ^[1]

Simula Research Lab., Oslo (Norway); Univ. of Oslo (Norway)
Univ. of California, San Diego, CA (United States)

Publication Date:: Wed Oct 05 00:00:00 EDT 2016

Research Org.:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Org.:: USDOE Office of Science (SC)

OSTI Identifier:: 1525220

Grant/Contract Number:: AC02-05CH11231

Resource Type:: Accepted Manuscript

Journal Name:: International Journal of Parallel Programming

Additional Journal Information:: Journal Volume: 45; Journal Issue: 3; Journal ID: ISSN 0885-7458

Publisher:: Springer

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING

Citation Formats


                    Sourouri, Mohammed, Baden, Scott B., and Cai, Xing. Panda: A Compiler Framework for Concurrent CPU $+$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers.  United States: N. p., 2016. 
Web.  doi:10.1007/s10766-016-0454-1.

Copy to clipboard


                    Sourouri, Mohammed, Baden, Scott B., & Cai, Xing. Panda: A Compiler Framework for Concurrent CPU $+$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers.  United States.  https://doi.org/10.1007/s10766-016-0454-1

Copy to clipboard


                    Sourouri, Mohammed, Baden, Scott B., and Cai, Xing. Wed .  
"Panda: A Compiler Framework for Concurrent CPU $+$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers".  United States.  https://doi.org/10.1007/s10766-016-0454-1.  https://www.osti.gov/servlets/purl/1525220.

Copy to clipboard


                    
@article{osti_1525220,

  title        = {Panda: A Compiler Framework for Concurrent CPU $+$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers},

  author       = {Sourouri, Mohammed and Baden, Scott B. and Cai, Xing},

  abstractNote = {We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of directives, sequential stencil C codes can be automatically parallelized for large-scale GPU clusters. The most distinctive feature of the compiler is its capability to generate hybrid MPI$+$ CUDA$+$ OpenMP code that uses concurrent CPU$+$ GPU computing to unleash the full potential of powerful GPU clusters. The auto-generated hybrid codes hide the overhead of various data motion by overlapping them with computation. Test results on the Titan supercomputer and the Wilkes cluster show that auto-translated codes can achieve about 90 % of the performance of highly optimized handwritten codes, for both a simple stencil benchmark and a real-world application in cardiac modeling. The user-friendliness and performance of our domain-specific compiler framework allow harnessing the full power of GPU-accelerated supercomputing without painstaking coding effort.},

  doi          = {10.1007/s10766-016-0454-1},

  journal      = {International Journal of Parallel Programming},

  number       = 3,

  volume       = 45,

  place        = {United States},

  year         = {Wed Oct 05 00:00:00 EDT 2016},

  month        = {Wed Oct 05 00:00:00 EDT 2016}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1007/s10766-016-0454-1

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 14 works

Citation information provided by
Web of Science

Figures / Tables:

Fig. 1: An architectural view of the Panda source-to-source compiler, which adopts a modular design. Each module may consist of numerous sub-modules, but for brevity, only the most important sub-modules are depicted.

All figures and tables (6 total)

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

An auto-tuning framework for parallel multicore stencil computations
conference, April 2010

Kamil, Shoaib; Chan, Cy; Oliker, Leonid
2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
DOI: 10.1109/IPDPS.2010.5470421

High-performance code generation for stencil computations on GPU architectures
conference, January 2012

Holewinski, Justin; Pouchet, Louis-Noël; Sadayappan, P.
Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
DOI: 10.1145/2304576.2304619

Mint: realizing CUDA performance in 3D stencil methods with annotated C
conference, January 2011

Unat, Didem; Cai, Xing; Baden, Scott B.
Proceedings of the international conference on Supercomputing - ICS '11
DOI: 10.1145/1995896.1995932

A Survey of CPU-GPU Heterogeneous Computing Techniques
journal, July 2015

Mittal, Sparsh; Vetter, Jeffrey S.
ACM Computing Surveys, Vol. 47, Issue 4
DOI: 10.1145/2788396

CPU+GPU Programming of Stencil Computations for Resource-Efficient Use of GPU Clusters
conference, October 2015

Sourouri, Mohammed; Langguth, Johannes; Spiga, Filippo
2015 IEEE 18th International Conference on Computational Science and Engineering (CSE)
DOI: 10.1109/CSE.2015.33

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers
conference, January 2011

Maruyama, Naoya; Nomura, Tatsuo; Sato, Kento
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
DOI: 10.1145/2063384.2063398

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
conference, January 2013

Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13
DOI: 10.1145/2491956.2462176

PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures
conference, May 2011

Christen, Matthias; Schenk, Olaf; Burkhart, Helmar
Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium
DOI: 10.1109/IPDPS.2011.70

A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers
conference, November 2014

Olschanowsky, Catherine; Strout, Michelle Mills; Guzik, Stephen
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2014.70

Towards automatic translation of OpenMP to MPI
conference, January 2005

Basumallik, Ayon; Eigenmann, Rudolf
Proceedings of the 19th annual international conference on Supercomputing - ICS '05
DOI: 10.1145/1088149.1088174

Understanding stencil code performance on multicore architectures
conference, January 2011

Rahman, Shah M. Faizur; Yi, Qing; Qasem, Apan
Proceedings of the 8th ACM International Conference on Computing Frontiers - CF '11
DOI: 10.1145/2016604.2016641

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
conference, January 2012

Zhang, Yongpeng; Mueller, Frank
Proceedings of the Tenth International Symposium on Code Generation and Optimization - CHO '12
DOI: 10.1145/2259016.2259037

Early evaluation of directive-based GPU programming models for productive exascale computing
conference, November 2012

Lee, Seyong; Vetter, Jeffrey S.
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2012.51

Abstract Machine Models and Proxy Architectures for Exascale Computing
conference, November 2014

Ang, J. A.; Barrett, R. F.; Benner, R. E.
2014 Hardware-Software Co-Design for High Performance Computing (Co-HPC)
DOI: 10.1109/Co-HPC.2014.4

Distributed memory code generation for mixed Irregular/Regular computations
conference, January 2015

Ravishankar, Mahesh; Dathathri, Roshan; Elango, Venmugil
Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP 2015
DOI: 10.1145/2688500.2688515

PARTANS: An autotuning framework for stencil computation on multi-GPU systems
journal, January 2013

Lutz, Thibaut; Fensch, Christian; Cole, Murray
ACM Transactions on Architecture and Code Optimization, Vol. 9, Issue 4
DOI: 10.1145/2400682.2400718

Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
conference, January 2009

Venkatasubramanian, Sundaresan; Vuduc, Richard W.; none, none
Proceedings of the 23rd international conference on Conference on Supercomputing - ICS '09
DOI: 10.1145/1542275.1542312

OpenMPC: Extended OpenMP Programming and Tuning for GPUs
conference, November 2010

Lee, Seyong; Eigenmann, Rudolf
2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2010.36

High Performance Stencil Code Algorithms for GPGPUs
journal, January 2011

Schäfer, Andreas; Fey, Dietmar
Procedia Computer Science, Vol. 4
DOI: 10.1016/j.procs.2011.04.221

STELLA: a domain-specific tool for structured grid methods in weather and climate models
conference, January 2015

Gysi, Tobias; Osuna, Carlos; Fuhrer, Oliver
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
DOI: 10.1145/2807591.2807627

Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer
conference, January 2011

Shimokawabe, Takashi; Aoki, Takayuki; Takaki, Tomohiro
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
DOI: 10.1145/2063384.2063388

SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters
conference, January 2012

Kim, Jungwon; Seo, Sangmin; Lee, Jun
Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
DOI: 10.1145/2304576.2304623

Hybrid Hexagonal/Classical Tiling for GPUs
conference, January 2014

Grosser, Tobias; Cohen, Albert; Holewinski, Justin
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization - CGO '14
DOI: 10.1145/2581122.2544160

Scalable Heterogeneous CPU-GPU Computations for Unstructured Tetrahedral Meshes
journal, July 2015

Langguth, Johannes; Sourouri, Mohammed; Lines, Glenn Terje
IEEE Micro, Vol. 35, Issue 4
DOI: 10.1109/MM.2015.70

Optimization of geometric multigrid for emerging multi- and manycore processors
conference, November 2012

Williams, Samuel; Kalamkar, Dhiraj D.; Singh, Amik
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2012.85

On the GPU Performance of 3D Stencil Computations Implemented in OpenCL
book, January 2013

Su, Huayou; Wu, Nan; Wen, Mei
Lecture Notes in Computer Science
DOI: 10.1007/978-3-642-38750-0_10

Roofline: an insightful visual performance model for multicore architectures
journal, April 2009

Williams, Samuel; Waterman, Andrew; Patterson, David
Communications of the ACM, Vol. 52, Issue 4
DOI: 10.1145/1498765.1498785

Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond
conference, November 2012

Levesque, John M.; Sankaran, Ramanan; Grout, Ray
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2012.69

High-Productivity Framework on GPU-Rich Supercomputers for Operational Weather Prediction Code ASUCA
conference, November 2014

Shimokawabe, Takashi; Aoki, Takayuki; Onodera, Naoyuki
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2014.26

Works referencing / citing this record:

Domain-Specific Multi-Level IR Rewriting for GPU
preprint, January 2020

Gysi, Tobias; Müller, Christoph; Zinenko, Oleksandr
arXiv
DOI: 10.48550/arxiv.2005.13014

Figures / Tables found in this record:

Figures/Tables have been extracted from DOE-funded journal article accepted manuscripts.

Similar Records in DOE PAGES and OSTI.GOV collections:

Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

Journal Article Xu, Chuanfu ; Deng, Xiaogang ; Zhang, Lilun ; ... - Journal of Computational Physics

Programming and optimizing complex, real-world CFD codes on current many-core accelerated HPC systems is very challenging, especially when collaborating CPUs and accelerators to fully tap the potential of heterogeneous systems. In this paper, with a tri-level hybrid and heterogeneous programming model using MPI + OpenMP + CUDA, we port and optimize our high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer. HOSTA adopts two self-developed high-order compact definite difference schemes WCNS and HDCS that can simulate flows with complex geometries. We present a dual-level parallelization scheme for efficient multi-block computation on GPUs and perform particular kernel optimizations formore »« less
https://doi.org/10.1016/J.JCP.2014.08.024
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Journal Article Basu, Protonu ; Williams, Samuel ; Van Straalen, Brian ; ... - Parallel Computing

GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found inmore »« less
Cited by 10
https://doi.org/10.1016/j.parco.2017.04.002

Full Text Available
Multi-GPU Implementation of a 3D Finite Difference Time Domain Earthquake Code on Heterogeneous Supercomputers

Journal Article Zhou, Jun ; Cui, Yifeng ; Poyraz, Efecan ; ... - Procedia Computer Science

We have developed a highly scalable 3D Finite Difference GPU code for use in earthquake engineering and disaster management through regional petascale earthquake simulations. This MPI-CUDA code is based on a widely-used wave propagation code called AWP-ODC and restructured for high throughput and efficiency on a heterogeneous computing architecture. We present an effective communication reduction technique for leveraging GPUs with minimal PCI-e overhead, and a novel overlapping method to fully hide data communication latency between GPUs. The optimization concept used in this work can be extended to general stencil computing on a structured grid. The benchmarks demonstrated sustained 100 TFlopsmore »« less
Cited by 14
https://doi.org/10.1016/j.procs.2013.05.292

Full Text Available
A Framework for Lattice QCD Calculations on GPUs

Conference Winter, Frank ; Clark, M A ; Edwards, Robert G ; ...

Computing platforms equipped with accelerators like GPUs have proven to provide great computational power. However, exploiting such platforms for existing scientific applications is not a trivial task. Current GPU programming frameworks such as CUDA C/C++ require low-level programming from the developer in order to achieve high performance code. As a result porting of applications to GPUs is typically limited to time-dominant algorithms and routines, leaving the remainder not accelerated which can open a serious Amdahl's law issue. The lattice QCD application Chroma allows to explore a different porting strategy. The layered structure of the software architecture logically separates the data-parallelmore »« less
https://doi.org/10.1109/IPDPS.2014.112

Full Text Available
Lattice Quantum Chromodynamics with Overlap Fermions on GPUs

Journal Article Alexandru, Andrei - Computing in Science and Engineering

Lattice quantum chromodynamics (QCD) calculations were one of the first applications to demonstrate the potential of GPUs in the area of high-performance computing; the nature of lattice QCD calculations matches well with the GPU's computational model. This article discusses ways to effectively use GPUs for lattice calculations using the overlap operator, a discretization that preserves chiral symmetry even at nonzero lattice spacing and makes possible lattice QCD simulations in the parameter region relevant to nuclear physics. The author shows that the large memory footprint of these codes requires the use of multiple GPUs in parallel and discusses methods for implementingmore »« less
https://doi.org/10.1109/mcse.2014.114

Similar Records

Title: Panda: A Compiler Framework for Concurrent CPU $$+$$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Abstract

Citation Formats

Figures / Tables:

An auto-tuning framework for parallel multicore stencil computations conference, April 2010

High-performance code generation for stencil computations on GPU architectures conference, January 2012

Mint: realizing CUDA performance in 3D stencil methods with annotated C conference, January 2011

A Survey of CPU-GPU Heterogeneous Computing Techniques journal, July 2015

CPU+GPU Programming of Stencil Computations for Resource-Efficient Use of GPU Clusters conference, October 2015

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers conference, January 2011

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines conference, January 2013

PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures conference, May 2011

A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers conference, November 2014

Towards automatic translation of OpenMP to MPI conference, January 2005

Understanding stencil code performance on multicore architectures conference, January 2011

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters conference, January 2012

Early evaluation of directive-based GPU programming models for productive exascale computing conference, November 2012

Abstract Machine Models and Proxy Architectures for Exascale Computing conference, November 2014

Distributed memory code generation for mixed Irregular/Regular computations conference, January 2015

PARTANS: An autotuning framework for stencil computation on multi-GPU systems journal, January 2013

Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems conference, January 2009

OpenMPC: Extended OpenMP Programming and Tuning for GPUs conference, November 2010

High Performance Stencil Code Algorithms for GPGPUs journal, January 2011

STELLA: a domain-specific tool for structured grid methods in weather and climate models conference, January 2015

Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer conference, January 2011

SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters conference, January 2012

Hybrid Hexagonal/Classical Tiling for GPUs conference, January 2014

Scalable Heterogeneous CPU-GPU Computations for Unstructured Tetrahedral Meshes journal, July 2015

Optimization of geometric multigrid for emerging multi- and manycore processors conference, November 2012

On the GPU Performance of 3D Stencil Computations Implemented in OpenCL book, January 2013

Roofline: an insightful visual performance model for multicore architectures journal, April 2009

Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond conference, November 2012

High-Productivity Framework on GPU-Rich Supercomputers for Operational Weather Prediction Code ASUCA conference, November 2014

Domain-Specific Multi-Level IR Rewriting for GPU preprint, January 2020

An auto-tuning framework for parallel multicore stencil computations
conference, April 2010

High-performance code generation for stencil computations on GPU architectures
conference, January 2012

Mint: realizing CUDA performance in 3D stencil methods with annotated C
conference, January 2011

A Survey of CPU-GPU Heterogeneous Computing Techniques
journal, July 2015

CPU+GPU Programming of Stencil Computations for Resource-Efficient Use of GPU Clusters
conference, October 2015

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers
conference, January 2011

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
conference, January 2013

PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures
conference, May 2011

A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers
conference, November 2014

Towards automatic translation of OpenMP to MPI
conference, January 2005

Understanding stencil code performance on multicore architectures
conference, January 2011

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
conference, January 2012

Early evaluation of directive-based GPU programming models for productive exascale computing
conference, November 2012

Abstract Machine Models and Proxy Architectures for Exascale Computing
conference, November 2014

Distributed memory code generation for mixed Irregular/Regular computations
conference, January 2015

PARTANS: An autotuning framework for stencil computation on multi-GPU systems
journal, January 2013

Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
conference, January 2009

OpenMPC: Extended OpenMP Programming and Tuning for GPUs
conference, November 2010

High Performance Stencil Code Algorithms for GPGPUs
journal, January 2011

STELLA: a domain-specific tool for structured grid methods in weather and climate models
conference, January 2015

Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer
conference, January 2011

SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters
conference, January 2012

Hybrid Hexagonal/Classical Tiling for GPUs
conference, January 2014

Scalable Heterogeneous CPU-GPU Computations for Unstructured Tetrahedral Meshes
journal, July 2015

Optimization of geometric multigrid for emerging multi- and manycore processors
conference, November 2012

On the GPU Performance of 3D Stencil Computations Implemented in OpenCL
book, January 2013

Roofline: an insightful visual performance model for multicore architectures
journal, April 2009

Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond
conference, November 2012

High-Productivity Framework on GPU-Rich Supercomputers for Operational Weather Prediction Code ASUCA
conference, November 2014

Domain-Specific Multi-Level IR Rewriting for GPU
preprint, January 2020