skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Abstract

GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found in many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.

Authors:
 [1];  [1];  [1];  [1];  [1];  [2]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  2. Univ. of Utah, Salt Lake City, UT (United States). School of Computing
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1379823
Alternate Identifier(s):
OSTI ID: 1397648
Grant/Contract Number:
AC02-05CH11231; AC05-00OR22725
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
Parallel Computing
Additional Journal Information:
Journal Volume: 64; Journal Issue: C; Journal ID: ISSN 0167-8191
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; GPU; Compiler; Autotuning; Multigrid

Citation Formats

Basu, Protonu, Williams, Samuel, Van Straalen, Brian, Oliker, Leonid, Colella, Phillip, and Hall, Mary. Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers. United States: N. p., 2017. Web. doi:10.1016/j.parco.2017.04.002.
Basu, Protonu, Williams, Samuel, Van Straalen, Brian, Oliker, Leonid, Colella, Phillip, & Hall, Mary. Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers. United States. doi:10.1016/j.parco.2017.04.002.
Basu, Protonu, Williams, Samuel, Van Straalen, Brian, Oliker, Leonid, Colella, Phillip, and Hall, Mary. Wed . "Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers". United States. doi:10.1016/j.parco.2017.04.002. https://www.osti.gov/servlets/purl/1379823.
@article{osti_1379823,
title = {Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers},
author = {Basu, Protonu and Williams, Samuel and Van Straalen, Brian and Oliker, Leonid and Colella, Phillip and Hall, Mary},
abstractNote = {GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model like CUDA, OpenCL, or OpenACC. Thus, in order to deliver portability across CPU-based and GPU-accelerated supercomputers, programmers are forced to write and maintain two versions of their applications or frameworks. In this paper, we explore the use of a compiler-based autotuning framework based on CUDA-CHiLL to deliver not only portability, but also performance portability across CPU- and GPU-accelerated platforms for the geometric multigrid linear solvers found in many scientific applications. We also show that with autotuning we can attain near Roofline (a performance bound for a computation and target architecture) performance across the key operations in the miniGMG benchmark for both CPU- and GPU-based architectures as well as for a multiple stencil discretizations and smoothers. We show that our technology is readily interoperable with MPI resulting in performance at scale equal to that obtained via hand-optimized MPI+CUDA implementation.},
doi = {10.1016/j.parco.2017.04.002},
journal = {Parallel Computing},
number = C,
volume = 64,
place = {United States},
year = {Wed Apr 05 00:00:00 EDT 2017},
month = {Wed Apr 05 00:00:00 EDT 2017}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:
  • This paper describes a compiler approach to introducing communication-avoiding optimizations in geometric multigrid (GMG), one of the most popular methods for solving partial differential equations. Communication-avoiding optimizations reduce vertical communication through the memory hierarchy and horizontal communication across processes or threads, usually at the expense of introducing redundant computation. We focus on applying these optimizations to the smooth operator, which successively reduces the error and accounts for the largest fraction of the GMG execution time. Our compiler technology applies both novel and known transformations to derive an implementation comparable to manually-tuned code. To make the approach portable, an underlying autotuningmore » system explores the tradeoff between reduced communication and increased computation, as well as tradeoffs in threading schemes, to automatically identify the best implementation for a particular architecture and at each computation phase. Results show that we are able to quadruple the performance of the smooth operation on the finest grids while attaining performance within 94% of manually-tuned code. Overall we improve the overall multigrid solve time by 2.5× without sacrificing programer productivity.« less
  • Modeling large-scale sky survey observations is a key driver for the continuing development of high-resolution, large-volume, cosmological simulations. We report the first results from the "Q Continuum" cosmological N-body simulation run carried out on the GPU-accelerated supercomputer Titan. The simulation encompasses a volume of (1300 Mpc)(3) and evolves more than half a trillion particles, leading to a particle mass resolution of m(p) similar or equal to 1.5 . 10(8) M-circle dot. At thismass resolution, the Q Continuum run is currently the largest cosmology simulation available. It enables the construction of detailed synthetic sky catalogs, encompassing different modeling methodologies, including semi-analyticmore » modeling and sub-halo abundance matching in a large, cosmological volume. Here we describe the simulation and outputs in detail and present first results for a range of cosmological statistics, such as mass power spectra, halo mass functions, and halo mass-concentration relations for different epochs. We also provide details on challenges connected to running a simulation on almost 90% of Titan, one of the fastest supercomputers in the world, including our usage of Titan's GPU accelerators.« less
  • In this paper, a new scalable hydrodynamic code, GPUPEGAS (GPU-accelerated Performance Gas Astrophysical Simulation), for the simulation of interacting galaxies is proposed. The details of a parallel numerical method co-design are described. A speed-up of 55 times was obtained within a single GPU accelerator. The use of 60 GPU accelerators resulted in 96% parallel efficiency. A collisionless hydrodynamic approach has been used for modeling of stars and dark matter. The scalability of the GPUPEGAS code is shown.
  • Purpose: Conventional spot scanning intensity modulated proton therapy (IMPT) treatment planning systems (TPSs) optimize proton spot weights based on analytical dose calculations. These analytical dose calculations have been shown to have severe limitations in heterogeneous materials. Monte Carlo (MC) methods do not have these limitations; however, MC-based systems have been of limited clinical use due to the large number of beam spots in IMPT and the extremely long calculation time of traditional MC techniques. In this work, the authors present a clinically applicable IMPT TPS that utilizes a very fast MC calculation. Methods: An in-house graphics processing unit (GPU)-based MCmore » dose calculation engine was employed to generate the dose influence map for each proton spot. With the MC generated influence map, a modified least-squares optimization method was used to achieve the desired dose volume histograms (DVHs). The intrinsic CT image resolution was adopted for voxelization in simulation and optimization to preserve spatial resolution. The optimizations were computed on a multi-GPU framework to mitigate the memory limitation issues for the large dose influence maps that resulted from maintaining the intrinsic CT resolution. The effects of tail cutoff and starting condition were studied and minimized in this work. Results: For relatively large and complex three-field head and neck cases, i.e., >100 000 spots with a target volume of ∼1000 cm{sup 3} and multiple surrounding critical structures, the optimization together with the initial MC dose influence map calculation was done in a clinically viable time frame (less than 30 min) on a GPU cluster consisting of 24 Nvidia GeForce GTX Titan cards. The in-house MC TPS plans were comparable to a commercial TPS plans based on DVH comparisons. Conclusions: A MC-based treatment planning system was developed. The treatment planning can be performed in a clinically viable time frame on a hardware system costing around 45 000 dollars. The fast calculation and optimization make the system easily expandable to robust and multicriteria optimization.« less