Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs

Toledo, Leonel; Valero-Lara, Pedro; Vetter, Jeffrey S.; Peña, Antonio J.

doi:10.3390/electronics11091307

Title: Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs

Journal Article · Wed Apr 20 00:00:00 EDT 2022 · Electronics

DOI:https://doi.org/10.3390/electronics11091307· OSTI ID:1883753

^[1];

^[2];

^[1]

Barcelona Supercomputing Center (BSC) (Spain)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

The main contribution of this work is to increase the coding productivity of GPU programming by using the concept of Static Graphs. GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, there are still some problems in terms of scalability and limitations to the amount of work that a GPU can perform at a time. To minimize the overhead associated with the launch of GPU kernels, as well as to maximize the use of GPU capacity, we have combined the new CUDA Graph API with the CUDA programming model (including CUDA math libraries) and the OpenACC programming model. We use as test cases two different, well-known and widely used problems in HPC and AI: the Conjugate Gradient method and the Particle Swarm Optimization. In the first test case (Conjugate Gradient) we focus on the integration of Static Graphs with CUDA. In this case, we are able to significantly outperform the NVIDIA reference code, reaching an acceleration of up to 11x thanks to a better implementation, which can benefit from the new CUDA Graph capabilities. In the second test case (Particle Swarm Optimization), we complement the OpenACC functionality with the use of CUDA Graph, achieving again accelerations of up to one order of magnitude, with average speedups ranging from 2x to 4x, and performance very close to a reference and optimized CUDA code. Our main target is to achieve a higher coding productivity model for GPU programming by using Static Graphs, which provides, in a very transparent way, a better exploitation of the GPU capacity. The combination of using Static Graphs with two of the current most important GPU programming models (CUDA and OpenACC) is able to reduce considerably the execution time w.r.t. the use of CUDA and OpenACC only, achieving accelerations of up to more than one order of magnitude. Finally, we propose an interface to incorporate the concept of Static Graphs into the OpenACC Specifications.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC); European Union’s Horizon 2020

Grant/Contract Number:: AC05-00OR22725; 801051

OSTI ID:: 1883753

Journal Information:: Electronics, Vol. 11, Issue 9; ISSN 2079-9292

Publisher:: MDPICopyright Statement

Country of Publication:: United States

Language:: English

References (19)

A GPU approach for accelerating 3D deformable registration (DARTEL) on brain biomedical images Valero-Lara, Pedro Proceedings of the 20th European MPI Users' Group Meeting on - EuroMPI '13 https://doi.org/10.1145/2488551.2488592	conference	January 2013
A Fast Solver for Large Tridiagonal Systems on Multi-Core Processors (Lass Library) Valero-Lara, Pedro; Andrade, Diego; Sirvent, Raul IEEE Access, Vol. 7 https://doi.org/10.1109/ACCESS.2019.2900122	journal	January 2019
OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES Duran, Alejandro; AyguadÉ, Eduard; Badia, Rosa M. Parallel Processing Letters, Vol. 21, Issue 02 https://doi.org/10.1142/S0129626411000151	journal	June 2011
cuConv: CUDA implementation of convolution for CNN inference Jordà, Marc; Valero-Lara, Pedro; Peña, Antonio J. Cluster Computing, Vol. 25, Issue 2 https://doi.org/10.1007/s10586-021-03494-y	journal	January 2022
Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications Jarząbek, Łukasz; Czarnul, Paweł The Journal of Supercomputing, Vol. 73, Issue 12 https://doi.org/10.1007/s11227-017-2091-x	journal	June 2017
Heterogeneous CPU+GPU approaches for mesh refinement over Lattice‐Boltzmann simulations Valero‐Lara, Pedro; Jansson, Johan Concurrency and Computation: Practice and Experience, Vol. 29, Issue 7 https://doi.org/10.1002/cpe.3919	journal	August 2016
Particle swarm optimization: An overview Poli, Riccardo; Kennedy, James; Blackwell, Tim Swarm Intelligence, Vol. 1, Issue 1 https://doi.org/10.1007/s11721-007-0002-0	journal	August 2007
Accelerating fluid–solid simulations (Lattice-Boltzmann & Immersed-Boundary) on heterogeneous architectures Valero-Lara, Pedro; Igual, Francisco D.; Prieto-Matías, Manuel Journal of Computational Science, Vol. 10 https://doi.org/10.1016/j.jocs.2015.07.002	journal	September 2015
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures Augonnet, Cédric; Thibault, Samuel; Namyst, Raymond Concurrency and Computation: Practice and Experience, Vol. 23, Issue 2 https://doi.org/10.1002/cpe.1631	journal	November 2010
A comparative study of GPU programming models and architectures using neural networks Pallipuram, Vivek K.; Bhuiyan, Mohammad; Smith, Melissa C. The Journal of Supercomputing, Vol. 61, Issue 3 https://doi.org/10.1007/s11227-011-0631-3	journal	May 2011
Performance and portability of accelerated lattice Boltzmann applications with OpenACC Calore, Enrico; Gabbana, Alessandro; Kraus, Jiri Concurrency and Computation: Practice and Experience, Vol. 28, Issue 12 https://doi.org/10.1002/cpe.3862	journal	May 2016
Comparing Programmer Productivity in Openacc and Cuda : An Empirical Investigation Li, Xuechao; Overbey, Jeffrey; Seals, Cheryl International Journal of Computer Science, Engineering and Applications, Vol. 6, Issue 5 https://doi.org/10.5121/ijcsea.2016.6501	journal	October 2016
Fast finite difference Poisson solvers on heterogeneous architectures Valero-Lara, Pedro; Pinelli, Alfredo; Prieto-Matias, Manuel Computer Physics Communications, Vol. 185, Issue 4 https://doi.org/10.1016/j.cpc.2013.12.026	journal	April 2014
Many-Task Computing on Many-Core Architectures Valero-Lara, Pedro; Nookala, Poornima; Pelayo, Fernando L. Scalable Computing: Practice and Experience, Vol. 17, Issue 1 https://doi.org/10.12694/scpe.v17i1.1148	journal	March 2016
Performance and Power Efficient Massive Parallel Computational Model for HPC Heterogeneous Exascale Systems Ashraf, M. Usman; Alburaei Eassa, Fathy; Ahmad Albeshri, Aiiad IEEE Access, Vol. 6 https://doi.org/10.1109/ACCESS.2018.2823299	journal	January 2018
Accelerating Solid-fluid Interaction using Lattice-boltzmann and Immersed Boundary Coupled Simulations on Heterogeneous Platforms Valero-Lara, Pedro; Pinelli, Alfredo; Prieto-Matias, Manuel Procedia Computer Science, Vol. 29 https://doi.org/10.1016/j.procs.2014.05.005	journal	January 2014
Multi-GPU acceleration of DARTEL (early detection of Alzheimer) Valero-Lara, Pedro 2014 IEEE International Conference On Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2014.6968783	conference	September 2014
Multi-domain Grid Refinement for Lattice-Boltzmann Simulations on Heterogeneous Platforms Valero-Lara, Pedro; Jansson, Johan 2015 IEEE 18th International Conference on Computational Science and Engineering https://doi.org/10.1109/CSE.2015.9	conference	October 2015
Static Graphs for Coding Productivity in OpenACC Toledo, Leonel; Valero-Lara, Pedro; Vetter, Jeffrey 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC) https://doi.org/10.1109/HiPC53243.2021.00050	conference	December 2021

Similar Records

Static Graphs for Coding Productivity in OpenACC

Conference · Wed Dec 01 00:00:00 EST 2021 · OSTI ID:1883753

Toledo, Leonel; Valero Lara, Pedro; Vetter, Jeffrey; +1 more

KokkACC: Enhancing Kokkos with OpenACC

Conference · Tue Nov 01 00:00:00 EDT 2022 · OSTI ID:1883753

Valero Lara, Pedro; Lee, Seyong; Gonzalez Tallada, Marc; +2 more

OpenACC unified programming environment for GPU and FPGA multi-hybrid acceleration

Conference · Wed Jul 01 00:00:00 EDT 2020 · OSTI ID:1883753

Tsunashima, Ryuta; Kobayashi, Ryohei; Fujita, Norihisa; +6 more

Related Subjects

58 GEOSCIENCES
coding productivity
tasking
data dependencies
static graph
CUDA
OpenACC
conjugate gradient
particle swarm optimization

Title: Towards Enhancing Coding Productivity for GPU Programming Using Static Graphs

Citation Formats

References (19)

Similar Records

Related Subjects