Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading

Mishra, Alok; Li, Lingda; Kong, Martin; Finkel, Hal; Chapman, Barbara

doi:10.1145/3148173.3148184

Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading

Conference · Sun Jan 01 04:00:00 EST 2017

DOI:https://doi.org/10.1145/3148173.3148184· OSTI ID:1412779

Mishra, Alok ^[1]; Li, Lingda ^[2]; Kong, Martin ^[2]; Finkel, Hal ^[3]; Chapman, Barbara ^[4]

Stony Brook Univ., Stony Brook, NY (United States)
Brookhaven National Lab. (BNL), Upton, NY (United States)
Argonne National Lab. (ANL), Argonne, IL (United States)
Stony Brook Univ., Stony Brook, NY (United States); Brookhaven National Lab. (BNL), Upton, NY (United States)

Here, the latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC systems have enhanced support for unified memory space. In such systems, CPU and GPU can access each other's memory transparently, that is, the data movement is managed automatically by the underlying system software and hardware. Memory over subscription is also possible in these systems. However, there is a significant lack of knowledge about how this mechanism will perform, and how programmers should use it. We have modified several benchmarks codes, in the Rodinia benchmark suite, to study the behavior of OpenMP accelerator extensions and have used them to explore the impact of unified memory in an OpenMP context. We moreover modified the open source LLVM compiler to allow OpenMP programs to exploit unified memory. The results of our evaluation reveal that, while the performance of unified memory is comparable with that of normal GPU offloading for benchmarks with little data reuse, it suffers from significant overhead when GPU memory is over subcribed for benchmarks with large amount of data reuse. Based on these results, we provide several guidelines for programmers to achieve better performance with unified memory.

Research Organization:: Brookhaven National Laboratory (BNL), Upton, NY (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (SC-21)

DOE Contract Number:: SC0012704

OSTI ID:: 1412779

Report Number(s):: BNL--114801-2017-JA

Country of Publication:: United States

Language:: English

References (10)

Optimal bypass monitor for high performance last-level caches Li, Lingda; Tong, Dong; Xie, Zichao Proceedings of the 21st international conference on Parallel architectures and compilation techniques - PACT '12 https://doi.org/10.1145/2370816.2370862	conference	January 2012
Offloading Support for OpenMP in Clang and LLVM Antao, Samuel F.; Bataev, Alexey; Jacob, Arpith C. 2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) https://doi.org/10.1109/LLVM-HPC.2016.006	conference	November 2016
Adaptive insertion policies for high performance caching Qureshi, Moinuddin K.; Jaleel, Aamer; Patt, Yale N. Proceedings of the 34th annual international symposium on Computer architecture - ISCA '07 https://doi.org/10.1145/1250662.1250709	conference	January 2007
Automatic CPU-GPU communication management and optimization Jablin, Thomas B.; Prabhu, Prakash; Jablin, James A. Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation - PLDI '11 https://doi.org/10.1145/1993498.1993516	conference	January 2011
The Scalable Heterogeneous Computing (SHOC) benchmark suite Danalis, Anthony; Marin, Gabriel; McCurdy, Collin Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units - GPGPU '10 https://doi.org/10.1145/1735688.1735702	conference	January 2010
Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems Mistry, Perhaad; Ukidave, Yash; Schaa, Dana Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units - GPGPU-6 https://doi.org/10.1145/2458523.2458529	conference	January 2013
LLVM: A compilation framework for lifelong program analysis & transformation Lattner, C.; Adve, V. International Symposium on Code Generation and Optimization, 2004. CGO 2004. https://doi.org/10.1109/CGO.2004.1281665	conference	January 2004
High performance cache replacement using re-reference interval prediction (RRIP) Jaleel, Aamer; Theobald, Kevin B.; Steely, Simon C. Proceedings of the 37th annual international symposium on Computer architecture - ISCA '10 https://doi.org/10.1145/1815961.1815971	conference	January 2010
OpenMP: an industry standard API for shared-memory programming Dagum, L.; Menon, R. IEEE Computational Science and Engineering, Vol. 5, Issue 1 https://doi.org/10.1109/99.660313	journal	January 1998
Auto-tuning a high-level language targeted to GPU codes Grauer-Gray, Scott; Xu, Lifan; Searles, Robert 2012 Innovative Parallel Computing (InPar) https://doi.org/10.1109/InPar.2012.6339595	conference	May 2012

Similar Records

Experimental Characterization of OpenMP Offloading Memory Operations and Unified Shared Memory Support

Conference · Fri Sep 01 00:00:00 EDT 2023 · OSTI ID:2000362

Manage OpenMP GPU Data Environment Under Unified Address Space

Conference · Wed Sep 26 00:00:00 EDT 2018 · OSTI ID:1484438

Related Subjects

97 MATHEMATICS AND COMPUTING
GPU
OpenMP offloading
benchmarking
performance evaluation
unified memory

Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading

Citation Formats

References (10)

Similar Records

Related Subjects