Performance Analysis of PIConGPU: Particle-in-Cell on GPUs using NVIDIA’s NSight Systems and NSight Compute

Leinhauser, Matthew; Young, Jeffrey; Bastrakov, Sergei; Widera, Rene; Chatterjee, Ronnie; Chandrasekaran, Sunita

doi:10.2172/1761619

Performance Analysis of PIConGPU: Particle-in-Cell on GPUs using NVIDIA’s NSight Systems and NSight Compute

Technical Report · Mon Jan 18 23:00:00 EST 2021

DOI:https://doi.org/10.2172/1761619· OSTI ID:1761619

Leinhauser, Matthew ^[1]; Young, Jeffrey ^[2]; Bastrakov, Sergei ^[3]; Widera, Rene ^[3]; ^[4]; Chandrasekaran, Sunita ^[1]

Univ. of Delaware, Newark, DE (United States)
Georgia Institute of Technology, Atlanta, GA (United States)
Helmholtz-Zentrum Dresden-Rossendorf (HDZR), Dresden (Germany)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

PIConGPU, Particle In Cell on GPUs, is an open source simulations framework for plasma and laser-plasma physics used to develop advanced particle accelerators for radiation therapy of cancer, high energy physics and photon science. While PIConGPU has been optimized for at least 5 years to run well on NVIDIA GPU-based clusters, there has been limited exploration by the development team of potential scalability bottlenecks using recently updated and new tools including NVIDIA’s NVProf tool and the brand-new NVIDIA NSight Suite (Systems and Compute) tools. PIConGPU is a highly optimized application that runs production jobs at scale on a system Oak Ridge Leadership Facility’s (OLCF) Summit supercomputer (using the full machine at 4600 nodes; at 98% of GPU utilization on all ~28000 NVIDIA Volta GPUs). PIConGPU has been selected as one of the the eight applications for OLCF’s coveted Center for Accelerated Application Readiness (CAAR) program aimed at the facility’s Frontier supercomputer (OLCF’s first exascale system to launch in 2021), to partner with our vendors (primary vendors: AMD and Cray/HPE) ensuring that Frontier will be able to perform large-scale science when it opens to users in 2022. To this effect, performance engineers on the PIConGPU team wanted to dive deep into the application to understand at the finest granularity, which portions of the code could be further optimized to exploit the hardware on Summit at it’s maximum potential and also to elucidate which key kernels should be tracked and optimized for the CAAR effort to port this code to Frontier. Any bottlenecks that are observed via performance profiling on Summit are likely to also impact scalability on the Frontier-dev system and the Frontier Early Access (EA) system. Additionally, the engineers wanted to take a closer look at the newest NVIDIA profiling tools which allows us to identify the most useful features on these tools and will provide an opportunity to compare it to new AMD and Cray’s performance analysis tool releases and provide feedback to our vendor partners on what features are most important and mission critical for CAAR efforts. The primary goal of this report is to focus on the evaluation of PIConGPU’s most time-intensive kernels using NVProf and NSight Suite. Three kernels, Current Deposition (also known as Compute Current), Particle Push (Move and Mark), and Shift Particles are known to be some of the most time-consuming kernels in PIConGPU. The Current xi Deposition kernel and Particle Push kernel both set up the particle attributes for running any physics simulation with PIConGPU, so it is crucial to improve the performance of these two kernels. In this report, we measure single GPU metrics for the three kernels, offer high level takeaways from the conducted analysis, and compare the profiling data from NSight Compute to that of NVProf. This analysis was performed using a grid size of 240 x 272 x 224, and 10 time steps with the Mid-November Figure of Merit (FOM) run setup. The Traveling Wave Electron Acceleration (TWEAC) science case used in this run is a representative science case for PIConGPU. This execution can also be used for baseline analysis on AMD MI50/ MI60 systems. As of the time of writing, the PIConGPU application has limited use for features of NSight Systems, so this report will mainly focus on insights garnered from NSight Compute. For this analysis, we run the “full” metric set available in NSight Compute version 2020.1.2 and use NSight Systems version 2020.3.1 to generate the application timeline.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1761619

Report Number(s):: ORNL/TM-2020/1813

Country of Publication:: United States

Language:: English

Similar Records

Optimization and Portability of a Fusion OpenACC-based FORTRAN HPC Code from NVIDIA to AMD GPUs

Conference · Sat Jul 01 00:00:00 EDT 2023 · OSTI ID:2301616

HPC Molecular Simulation Tries Out a New GPU: Experiences on Early AMD Test Systems for the Frontier Supercomputer

Conference · Wed Jun 01 00:00:00 EDT 2022 · OSTI ID:1883870

Climbing the Summit and Pushing the Frontier of Mixed Precision Benchmarks at Extreme Scale

Conference · Tue Nov 01 00:00:00 EDT 2022 · OSTI ID:1997799

Related Subjects

70 PLASMA PHYSICS AND FUSION TECHNOLOGY
97 MATHEMATICS AND COMPUTING

Performance Analysis of PIConGPU: Particle-in-Cell on GPUs using NVIDIA’s NSight Systems and NSight Compute

Citation Formats

Similar Records

Related Subjects