GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

Potluri, Sreeram; Goswami, Anshuman; Rossetti, Davide; Newburn, Chris (CJ); Gorentla Venkata, Manjunath; Imam, Neena

doi:10.1109/HiPC.2017.00037

GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

Conference · Thu Nov 30 23:00:00 EST 2017

DOI:https://doi.org/10.1109/HiPC.2017.00037· OSTI ID:1427708

Potluri, Sreeram ^[1]; Goswami, Anshuman ^[1]; Rossetti, Davide ^[1]; Newburn, Chris (CJ) ^[1]; ^[2]; ^[2]

NVIDIA, Santa Clara, CA
ORNL

GPUs have become an essential component for building compute clusters with high compute density and high performance per watt. As such clusters scale to have 1000s of GPUs, efficiently moving data between the GPUs becomes imperative to get maximum performance. NVSHMEM is an implementation of the OpenSHMEM standard for NVIDIA GPU clusters which allows communication to be issued from inside GPU kernels. In earlier work, we have shown how NVSHMEM can be used to achieve better application performance on GPUs connected through PCIe or NVLink. As part of this effort, we implement IB verbs for Mellanox InfiniBand adapters in CUDA. We evaluate different design alternatives, taking into consideration the relaxed memory model, automatic memory access coalescing and thread hierarchy on the GPU. We also consider correctness issues that arise in these designs. We take advantage of these designs transparently or through API extensions in NVSHMEM. With micro-benchmarks, we show that a Nvidia Pascal P100 GPU is able saturate the network bandwidth using only one or two of its 56 available streaming multiprocessors (SM). On a single GPU using a single IB EDR adapter, we achieve a throughput of around 90 million messages per second. In addition, we implement a 2dstencil application kernel using NVSHMEM and compare its performance with a CUDA-aware MPI-based implementation that uses GPUDirect RDMA. Speedups in the range of 23% to 42% are seen for input sizes large enough to fill the occupancy of Nvidia Pascal P100 GPUs on 2 to 4 nodes indicating that there are gains to be had by eliminating the CPU from the communication path when all computation runs on the GPU.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1427708

Country of Publication:: United States

Language:: English

Similar Records

Efficient Breadth First Search on Multi-GPU Systems Using GPU-Centric OpenSHMEM

Conference · Mon Aug 07 00:00:00 EDT 2017 · OSTI ID:1567474

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Conference · Sun Dec 31 23:00:00 EST 2017 · OSTI ID:1525777

Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite

Conference · Sun Sep 30 00:00:00 EDT 2018 · OSTI ID:1511696

GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

Citation Formats

Similar Records

Related Subjects