Panda: A Compiler Framework for Concurrent CPU $+$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Sourouri, Mohammed; Baden, Scott B.; Cai, Xing

doi:10.1007/s10766-016-0454-1

Title: Panda: A Compiler Framework for Concurrent CPU $$+$$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Journal Article · Wed Oct 05 00:00:00 EDT 2016 · International Journal of Parallel Programming

DOI:https://doi.org/10.1007/s10766-016-0454-1· OSTI ID:1525220

^[1]; Baden, Scott B. ^[2]; Cai, Xing ^[1]

Simula Research Lab., Oslo (Norway); Univ. of Oslo (Norway)
Univ. of California, San Diego, CA (United States)

We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of directives, sequential stencil C codes can be automatically parallelized for large-scale GPU clusters. The most distinctive feature of the compiler is its capability to generate hybrid MPI$$+$$ CUDA$$+$$ OpenMP code that uses concurrent CPU$$+$$ GPU computing to unleash the full potential of powerful GPU clusters. The auto-generated hybrid codes hide the overhead of various data motion by overlapping them with computation. Test results on the Titan supercomputer and the Wilkes cluster show that auto-translated codes can achieve about 90 % of the performance of highly optimized handwritten codes, for both a simple stencil benchmark and a real-world application in cardiac modeling. The user-friendliness and performance of our domain-specific compiler framework allow harnessing the full power of GPU-accelerated supercomputing without painstaking coding effort.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

Grant/Contract Number:: AC02-05CH11231

OSTI ID:: 1525220

Journal Information:: International Journal of Parallel Programming, Vol. 45, Issue 3; ISSN 0885-7458

Publisher:: SpringerCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 14 works

Citation information provided by
Web of Science

References (29)

An auto-tuning framework for parallel multicore stencil computations Kamil, Shoaib; Chan, Cy; Oliker, Leonid 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) https://doi.org/10.1109/IPDPS.2010.5470421	conference	April 2010
High-performance code generation for stencil computations on GPU architectures Holewinski, Justin; Pouchet, Louis-Noël; Sadayappan, P. Proceedings of the 26th ACM international conference on Supercomputing - ICS '12 https://doi.org/10.1145/2304576.2304619	conference	January 2012
Mint: realizing CUDA performance in 3D stencil methods with annotated C Unat, Didem; Cai, Xing; Baden, Scott B. Proceedings of the international conference on Supercomputing - ICS '11 https://doi.org/10.1145/1995896.1995932	conference	January 2011
A Survey of CPU-GPU Heterogeneous Computing Techniques Mittal, Sparsh; Vetter, Jeffrey S. ACM Computing Surveys, Vol. 47, Issue 4 https://doi.org/10.1145/2788396	journal	July 2015
CPU+GPU Programming of Stencil Computations for Resource-Efficient Use of GPU Clusters Sourouri, Mohammed; Langguth, Johannes; Spiga, Filippo 2015 IEEE 18th International Conference on Computational Science and Engineering (CSE) https://doi.org/10.1109/CSE.2015.33	conference	October 2015
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers Maruyama, Naoya; Nomura, Tatsuo; Sato, Kento Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063398	conference	January 2011
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines Ragan-Kelley, Jonathan; Barnes, Connelly; Adams, Andrew Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation - PLDI '13 https://doi.org/10.1145/2491956.2462176	conference	January 2013
PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures Christen, Matthias; Schenk, Olaf; Burkhart, Helmar Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.70	conference	May 2011
A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers Olschanowsky, Catherine; Strout, Michelle Mills; Guzik, Stephen SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.70	conference	November 2014
Towards automatic translation of OpenMP to MPI Basumallik, Ayon; Eigenmann, Rudolf Proceedings of the 19th annual international conference on Supercomputing - ICS '05 https://doi.org/10.1145/1088149.1088174	conference	January 2005
Understanding stencil code performance on multicore architectures Rahman, Shah M. Faizur; Yi, Qing; Qasem, Apan Proceedings of the 8th ACM International Conference on Computing Frontiers - CF '11 https://doi.org/10.1145/2016604.2016641	conference	January 2011
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters Zhang, Yongpeng; Mueller, Frank Proceedings of the Tenth International Symposium on Code Generation and Optimization - CHO '12 https://doi.org/10.1145/2259016.2259037	conference	January 2012
Early evaluation of directive-based GPU programming models for productive exascale computing Lee, Seyong; Vetter, Jeffrey S. 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.51	conference	November 2012
Abstract Machine Models and Proxy Architectures for Exascale Computing Ang, J. A.; Barrett, R. F.; Benner, R. E. 2014 Hardware-Software Co-Design for High Performance Computing (Co-HPC) https://doi.org/10.1109/Co-HPC.2014.4	conference	November 2014
Distributed memory code generation for mixed Irregular/Regular computations Ravishankar, Mahesh; Dathathri, Roshan; Elango, Venmugil Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP 2015 https://doi.org/10.1145/2688500.2688515	conference	January 2015
PARTANS: An autotuning framework for stencil computation on multi-GPU systems Lutz, Thibaut; Fensch, Christian; Cole, Murray ACM Transactions on Architecture and Code Optimization, Vol. 9, Issue 4 https://doi.org/10.1145/2400682.2400718	journal	January 2013
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems Venkatasubramanian, Sundaresan; Vuduc, Richard W.; none, none Proceedings of the 23rd international conference on Conference on Supercomputing - ICS '09 https://doi.org/10.1145/1542275.1542312	conference	January 2009
OpenMPC: Extended OpenMP Programming and Tuning for GPUs Lee, Seyong; Eigenmann, Rudolf 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.36	conference	November 2010
High Performance Stencil Code Algorithms for GPGPUs Schäfer, Andreas; Fey, Dietmar Procedia Computer Science, Vol. 4 https://doi.org/10.1016/j.procs.2011.04.221	journal	January 2011
STELLA: a domain-specific tool for structured grid methods in weather and climate models Gysi, Tobias; Osuna, Carlos; Fuhrer, Oliver Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807627	conference	January 2015
Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer Shimokawabe, Takashi; Aoki, Takayuki; Takaki, Tomohiro Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063388	conference	January 2011
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters Kim, Jungwon; Seo, Sangmin; Lee, Jun Proceedings of the 26th ACM international conference on Supercomputing - ICS '12 https://doi.org/10.1145/2304576.2304623	conference	January 2012
Hybrid Hexagonal/Classical Tiling for GPUs Grosser, Tobias; Cohen, Albert; Holewinski, Justin Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization - CGO '14 https://doi.org/10.1145/2581122.2544160	conference	January 2014
Scalable Heterogeneous CPU-GPU Computations for Unstructured Tetrahedral Meshes Langguth, Johannes; Sourouri, Mohammed; Lines, Glenn Terje IEEE Micro, Vol. 35, Issue 4 https://doi.org/10.1109/MM.2015.70	journal	July 2015
Optimization of geometric multigrid for emerging multi- and manycore processors Williams, Samuel; Kalamkar, Dhiraj D.; Singh, Amik 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.85	conference	November 2012
On the GPU Performance of 3D Stencil Computations Implemented in OpenCL Su, Huayou; Wu, Nan; Wen, Mei Lecture Notes in Computer Science https://doi.org/10.1007/978-3-642-38750-0_10	book	January 2013
Roofline: an insightful visual performance model for multicore architectures Williams, Samuel; Waterman, Andrew; Patterson, David Communications of the ACM, Vol. 52, Issue 4 https://doi.org/10.1145/1498765.1498785	journal	April 2009
Hybridizing S3D into an Exascale application using OpenACC: An approach for moving to multi-petaflops and beyond Levesque, John M.; Sankaran, Ramanan; Grout, Ray 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.69	conference	November 2012
High-Productivity Framework on GPU-Rich Supercomputers for Operational Weather Prediction Code ASUCA Shimokawabe, Takashi; Aoki, Takayuki; Onodera, Naoyuki SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.26	conference	November 2014

Cited By (1)

Domain-Specific Multi-Level IR Rewriting for GPU Gysi, Tobias; Müller, Christoph; Zinenko, Oleksandr arXiv https://doi.org/10.48550/arxiv.2005.13014	preprint	January 2020