Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found in many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.
- Authors:
-
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Univ. of Utah, Salt Lake City, UT (United States). School of Computing
- Publication Date:
- Research Org.:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- OSTI Identifier:
- 1379823
- Alternate Identifier(s):
- OSTI ID: 1397648
- Grant/Contract Number:
- AC02-05CH11231; AC05-00OR22725
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Parallel Computing
- Additional Journal Information:
- Journal Volume: 64; Journal Issue: C; Journal ID: ISSN 0167-8191
- Publisher:
- Elsevier
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; GPU; Compiler; Autotuning; Multigrid
Citation Formats
Basu, Protonu, Williams, Samuel, Van Straalen, Brian, Oliker, Leonid, Colella, Phillip, and Hall, Mary. Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers. United States: N. p., 2017.
Web. doi:10.1016/j.parco.2017.04.002.
Basu, Protonu, Williams, Samuel, Van Straalen, Brian, Oliker, Leonid, Colella, Phillip, & Hall, Mary. Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers. United States. https://doi.org/10.1016/j.parco.2017.04.002
Basu, Protonu, Williams, Samuel, Van Straalen, Brian, Oliker, Leonid, Colella, Phillip, and Hall, Mary. Wed .
"Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers". United States. https://doi.org/10.1016/j.parco.2017.04.002. https://www.osti.gov/servlets/purl/1379823.
@article{osti_1379823,
title = {Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers},
author = {Basu, Protonu and Williams, Samuel and Van Straalen, Brian and Oliker, Leonid and Colella, Phillip and Hall, Mary},
abstractNote = {GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found in many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.},
doi = {10.1016/j.parco.2017.04.002},
journal = {Parallel Computing},
number = C,
volume = 64,
place = {United States},
year = {Wed Apr 05 00:00:00 EDT 2017},
month = {Wed Apr 05 00:00:00 EDT 2017}
}
Web of Science
Works referenced in this record:
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors
journal, February 2009
- Datta, Kaushik; Kamil, Shoaib; Williams, Samuel
- SIAM Review, Vol. 51, Issue 1
Improving the arithmetic intensity of multigrid with the help of polynomial smoothers: IMPROVING MULTIGRIDS ARITHMETIC INTENSITY
journal, February 2012
- Ghysels, P.; Kłosiewicz, P.; Vanroose, W.
- Numerical Linear Algebra with Applications, Vol. 19, Issue 2
A script-based autotuning compiler system to generate high-performance CUDA code
journal, January 2013
- Khan, Malik; Basu, Protonu; Rudy, Gabe
- ACM Transactions on Architecture and Code Optimization, Vol. 9, Issue 4
Roofline: an insightful visual performance model for multicore architectures
journal, April 2009
- Williams, Samuel; Waterman, Andrew; Patterson, David
- Communications of the ACM, Vol. 52, Issue 4
Introducing a parallel cache oblivious blocking approach for the lattice Boltzmann method
journal, January 2008
- Zeiser, T.; Wellein, G.; Nitsure, A.
- Progress in Computational Fluid Dynamics, An International Journal, Vol. 8, Issue 1/2/3/4
Works referencing / citing this record:
A Survey on Compiler Autotuning using Machine Learning
journal, January 2019
- Ashouri, Amir H.; Killian, William; Cavazos, John
- ACM Computing Surveys, Vol. 51, Issue 5
Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight
journal, May 2019
- Ma, Wenjing; Ao, Yulong; Yang, Chao
- Cluster Computing, Vol. 23, Issue 2
A Survey on Compiler Autotuning using Machine Learning
text, January 2018
- Ashouri, Amir H.; Killian, William; Cavazos, John
- arXiv
Accelerating Multigrid-based Hierarchical Scientific Data Refactoring on GPUs
preprint, January 2020
- Chen, Jieyang; Wan, Lipeng; Liang, Xin
- arXiv