Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Sum Reduction with OpenMP Offload on NVIDIA Grace-Hopper System

Conference ·
OSTI ID:2483412

We evaluate the performance of the baseline and optimized reductions in OpenMP on an NVIDIA Grace-Hopper system. We explore the impacts of the number of teams, the number of elements to sum per loop iteration, and simultaneous execution on the central-processing unit (CPU) and the GPU in the unified memory (UM) mode upon the reduction performance. The experimental results show that the optimized reductions are 6.120X to 20.906X faster than the baselines on the GPU, and their efficiency ranges from 89% to 95% of the theoretical GPU memory bandwidth. Depending on where an input array is allocated in the program when co-running the reduction on the CPU and GPU in the UM mode, the average speedup over the GPU-only execution is approximately 2.484 or 1.067, and the speedup of the optimized reductions over the baseline reductions ranges from 0.996 to 10.654 or from 0.998 to 6.729.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
2483412
Country of Publication:
United States
Language:
English

Similar Records

Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee
Conference · Sun Dec 29 23:00:00 EST 2024 · OSTI ID:2519665

Studying CPU and memory utilization of applications on Fujitsu A64FX and Nvidia Grace Superchip
Conference · Tue Dec 10 23:00:00 EST 2024 · OSTI ID:2496226

OpenMP Target Task: Tasking and Target Offloading on Heterogeneous Systems
Conference · Wed Jun 01 00:00:00 EDT 2022 · OSTI ID:1885285

Related Subjects