Heterogeneous computing with OpenMP and Hydra

Diener, Matthias; Kale, Laxmikant V.; Bodony, Daniel J.

doi:10.1002/cpe.5728

Title: Heterogeneous computing with OpenMP and Hydra

Abstract

Summary High‐performance computing relies on accelerators (such as GPGPUs) to achieve fast execution of scientific applications. Traditionally, these accelerators have been programmed with specialized languages, such as CUDA or OpenCL. In recent years, OpenMP emerged as a promising alternative for supporting accelerators, providing advantages such as maintaining a single code base for the host and different accelerator types and providing a simple way to extend support for accelerators to existing application codes. Efficiently using this support requires solving several challenges, related to performance, work partitioning, and concurrent execution on multiple device types. In this article, we discuss our experiences with using OpenMP for accelerators and present performance guidelines. We also introduce a library, Hydra, that addresses several of the challenges of using OpenMP for such devices. We apply Hydra to a scientific application, PlasCom2, that has not previously been able to use accelerators. Experiments on three architectures show that Hydra results in performance gains of up to 10× compared with CPU‐only execution. Concurrent execution on the host and GPU resulted in additional gains of up to 20% compared to running on the GPU only.

Authors:

^[1]; Kale, Laxmikant V. ^[1]; Bodony, Daniel J. ^[1]

University of Illinois at Urbana‐Champaign Champaign Illinois USA

Publication Date:: Sat Mar 07 00:00:00 EST 2020

Sponsoring Org.:: USDOE

OSTI Identifier:: 1603688

Resource Type:: Publisher's Accepted Manuscript

Journal Name:: Concurrency and Computation. Practice and Experience

Additional Journal Information:: Journal Name: Concurrency and Computation. Practice and Experience Journal Volume: 32 Journal Issue: 20; Journal ID: ISSN 1532-0626

Publisher:: Wiley Blackwell (John Wiley & Sons)

Country of Publication:: United Kingdom

Language:: English

Citation Formats


                    Diener, Matthias, Kale, Laxmikant V., and Bodony, Daniel J. Heterogeneous computing with OpenMP and Hydra.  United Kingdom: N. p., 2020. 
Web.  doi:10.1002/cpe.5728.

Copy to clipboard


                    Diener, Matthias, Kale, Laxmikant V., & Bodony, Daniel J. Heterogeneous computing with OpenMP and Hydra.  United Kingdom.  https://doi.org/10.1002/cpe.5728

Copy to clipboard


                    Diener, Matthias, Kale, Laxmikant V., and Bodony, Daniel J. Sat .  
"Heterogeneous computing with OpenMP and Hydra".  United Kingdom.  https://doi.org/10.1002/cpe.5728.

Copy to clipboard


                    
@article{osti_1603688,

  title        = {Heterogeneous computing with OpenMP and Hydra},

  author       = {Diener, Matthias and Kale, Laxmikant V. and Bodony, Daniel J.},

  abstractNote = {Summary High‐performance computing relies on accelerators (such as GPGPUs) to achieve fast execution of scientific applications. Traditionally, these accelerators have been programmed with specialized languages, such as CUDA or OpenCL. In recent years, OpenMP emerged as a promising alternative for supporting accelerators, providing advantages such as maintaining a single code base for the host and different accelerator types and providing a simple way to extend support for accelerators to existing application codes. Efficiently using this support requires solving several challenges, related to performance, work partitioning, and concurrent execution on multiple device types. In this article, we discuss our experiences with using OpenMP for accelerators and present performance guidelines. We also introduce a library, Hydra, that addresses several of the challenges of using OpenMP for such devices. We apply Hydra to a scientific application, PlasCom2, that has not previously been able to use accelerators. Experiments on three architectures show that Hydra results in performance gains of up to 10× compared with CPU‐only execution. Concurrent execution on the host and GPU resulted in additional gains of up to 20% compared to running on the GPU only.},

  doi          = {10.1002/cpe.5728},

  journal      = {Concurrency and Computation. Practice and Experience},

  number       = 20,

  volume       = 32,

  place        = {United Kingdom},

  year         = {Sat Mar 07 00:00:00 EST 2020},

  month        = {Sat Mar 07 00:00:00 EST 2020}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Publisher's Version of Record
https://doi.org/10.1002/cpe.5728

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 2 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Self-Adaptive OmpSs Tasks in Heterogeneous Environments
conference, May 2013

Planas, Judit; Badia, Rosa M.; Ayguade, Eduard
2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
DOI: 10.1109/IPDPS.2013.53

Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model
conference, May 2013

Xu, Rengan; Chandrasekaran, Sunita; Chapman, Barbara
2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW)
DOI: 10.1109/IPDPSW.2013.263

Efficient Fork-Join on GPUs Through Warp Specialization
conference, December 2017

Jacob, Arpith Chacko; Eichenberger, Alexandre E.; Sung, Hyojin
2017 IEEE 24th International Conference on High Performance Computing (HiPC)
DOI: 10.1109/HiPC.2017.00048

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
journal, December 2014

Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel
Journal of Parallel and Distributed Computing, Vol. 74, Issue 12
DOI: 10.1016/j.jpdc.2014.07.003

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures
conference, May 2013

Gautier, Thierry; Lima, Joao V. F.; Maillard, Nicolas
2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
DOI: 10.1109/IPDPS.2013.66

The Spack package manager: bringing order to HPC software chaos
conference, January 2015

Gamblin, Todd; LeGendre, Matthew; Collette, Michael R.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
DOI: 10.1145/2807591.2807623

Improving the memory access locality of hybrid MPI applications
conference, January 2017

Diener, Matthias; White, Sam; Kale, Laxmikant V.
Proceedings of the 24th European MPI Users' Group Meeting on - EuroMPI '17
DOI: 10.1145/3127024.3127038

DawnCC: Automatic Annotation for Data Parallelism and Offloading
journal, May 2017

Mendonça, Gleison; Guimarães, Breno; Alves, Péricles
ACM Transactions on Architecture and Code Optimization, Vol. 14, Issue 2
DOI: 10.1145/3084540

Chai: Collaborative heterogeneous applications for integrated-architectures
conference, April 2017

Gomez-Luna, Juan; Hajj, Izzat El; Chang, Li-Wen
2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
DOI: 10.1109/ISPASS.2017.7975269

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
journal, November 2010

Augonnet, Cédric; Thibault, Samuel; Namyst, Raymond
Concurrency and Computation: Practice and Experience, Vol. 23, Issue 2
DOI: 10.1002/cpe.1631

HPX: A Task Based Programming Model in a Global Address Space
conference, January 2014

Kaiser, Hartmut; Heller, Thomas; Adelstein-Lelbach, Bryce
Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models - PGAS '14
DOI: 10.1145/2676870.2676883

Legion: Expressing locality and independence with logical regions
conference, November 2012

Bauer, Michael; Treichler, Sean; Slaughter, Elliott
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2012.71

Performance analysis of OpenMP on a GPU using a CORAL proxy application
conference, January 2015

Bercea, Gheorghe-Teodor; Appelhans, David; O'Brien, Kevin
Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems - PMBS '15
DOI: 10.1145/2832087.2832089

A uniform approach for programming distributed heterogeneous computing systems
journal, December 2014

Grasso, Ivan; Pellegrini, Simone; Cosenza, Biagio
Journal of Parallel and Distributed Computing, Vol. 74, Issue 12
DOI: 10.1016/j.jpdc.2014.08.002

A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters
conference, November 2014

Noack, Matthias; Wende, Florian; Steinke, Thomas
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2014.22

Directive-based Programming Models for Scientific Applications - A Comparison
conference, November 2012

Xu, Rengan; Chandrasekaran, Sunita; Chapman, Barbara
2012 SC Companion: High-Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion: High Performance Computing, Networking Storage and Analysis
DOI: 10.1109/SCC.2012.6522594

Hetero-mark, a benchmark suite for CPU-GPU collaborative computing
conference, September 2016

Sun, Yifan; Gong, Xiang; Ziabari, Amir Kavyan
2016 IEEE International Symposium on Workload Characterization (IISWC)
DOI: 10.1109/IISWC.2016.7581262

Similar Records in DOE PAGES and OSTI.GOV collections:

Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA

Journal Article Spies, Lukas ; Bienz, Amanda ; Moulton, John David ; ... - Parallel Computing

Exchanging halo data is a common task in modern scientific computing applications and efficient handling of this operation is critical for the performance of the overall simulation. Tausch is a novel header-only library that provides a simple API for efficiently handling these types of data movements. Tausch supports both simple CPU-only systems, but also more complex heterogeneous systems with both CPUs and GPUs. It currently supports both OpenCL and CUDA for communicating with GPGPU devices, and allows for communication between GPGPUs and CPUs. The API allows for drop-in replacement in existing codes and can be used for the communication layermore »« less
https://doi.org/10.1016/j.parco.2022.102973

Full Text Available
Python for Development of OpenMP and CUDA Kernels for Multidimensional Data

Conference Bell, Zane W ; Davidson, Gregory G ; D'Azevedo, Ed F ; ...

Design of data structures for high performance computing (HPC) is one of the principal challenges facing researchers looking to utilize heterogeneous computing machinery. Heterogeneous systems derive cost, power, and speed efficiency by being composed of the appropriate hardware for the task. Yet, each type of processor requires a specific organization of the application state in order to achieve peak performance. Discovering this and refactoring the code can be a challenging and time-consuming task for the researcher, as the data structures and the computational model must be co-designed. We present a methodology that uses Python as the environment for which tomore »« less
Getting To Exascale: Applying Novel Parallel Programming Models To Lab Applications For The Next Generation Of Supercomputers

Technical Report Dube, Evi ; Shereda, Charles ; Nau, Lee ; ...

As supercomputing moves toward exascale, node architectures will change significantly. CPU core counts on nodes will increase by an order of magnitude or more. Heterogeneous architectures will become more commonplace, with GPUs or FPGAs providing additional computational power. Novel programming models may make better use of on-node parallelism in these new architectures than do current models. In this paper we examine several of these novel models – UPC, CUDA, and OpenCL –to determine their suitability to LLNL scientific application codes. Our study consisted of several phases: We conducted interviews with code teams and selected two codes to port; We learnedmore »« less
https://doi.org/10.2172/1124806

Full Text Available
Experiences with High-Level Programming Directives for Porting Applications to GPUs

Conference Hernandez, Oscar R ; Ding, Wei ; Chapman, Barbara ; ...

HPC systems now exploit GPUs within their compute nodes to accelerate program performance. As a result, high-end application development has become extremely complex at the node level. In addition to restructuring the node code to exploit the cores and specialized devices, the programmer may need to choose a programming model such as OpenMP or CPU threads in conjunction with an accelerator programming model to share and manage the difference node resources. This comes at a time when programmer productivity and the ability to produce portable code has been recognized as a major concern. In order to offset the high developmentmore »« less
Experiences with High-Level Programming Directives for Porting Applications to GPUs. In: Facing the Multicore--Challenge II, Lecture Notes in Computer Science.

Book Hernandez, Oscar ; Ding, Wei ; Chapman, Barbara ; ...

HPC systems now exploit GPUs within their compute nodes to accelerate program performance. As a result, high-end application development has become extremely complex at the node level. In addition to restructuring the node code to exploit the cores and specialized devices, the programmer may need to choose a programming model such as OpenMP or CPU threads in conjunction with an accelerator programming model to share and manage the different node resources. This comes at a time when programmer productivity and the ability to produce portable code has been recognized as a major concern. In order to offset the high developmentmore »« less
https://doi.org/10.1007/978-3-642-30397-5_9

Similar Records

Title: Heterogeneous computing with OpenMP and Hydra

Abstract

Citation Formats

Self-Adaptive OmpSs Tasks in Heterogeneous Environments conference, May 2013

Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model conference, May 2013

Efficient Fork-Join on GPUs Through Warp Specialization conference, December 2017

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns journal, December 2014

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures conference, May 2013

The Spack package manager: bringing order to HPC software chaos conference, January 2015

Improving the memory access locality of hybrid MPI applications conference, January 2017

DawnCC: Automatic Annotation for Data Parallelism and Offloading journal, May 2017

Chai: Collaborative heterogeneous applications for integrated-architectures conference, April 2017

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures journal, November 2010

HPX: A Task Based Programming Model in a Global Address Space conference, January 2014

Legion: Expressing locality and independence with logical regions conference, November 2012

Performance analysis of OpenMP on a GPU using a CORAL proxy application conference, January 2015

A uniform approach for programming distributed heterogeneous computing systems journal, December 2014

A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters conference, November 2014

Directive-based Programming Models for Scientific Applications - A Comparison conference, November 2012

Hetero-mark, a benchmark suite for CPU-GPU collaborative computing conference, September 2016

Self-Adaptive OmpSs Tasks in Heterogeneous Environments
conference, May 2013

Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model
conference, May 2013

Efficient Fork-Join on GPUs Through Warp Specialization
conference, December 2017

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
journal, December 2014

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures
conference, May 2013

The Spack package manager: bringing order to HPC software chaos
conference, January 2015

Improving the memory access locality of hybrid MPI applications
conference, January 2017

DawnCC: Automatic Annotation for Data Parallelism and Offloading
journal, May 2017

Chai: Collaborative heterogeneous applications for integrated-architectures
conference, April 2017

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
journal, November 2010

HPX: A Task Based Programming Model in a Global Address Space
conference, January 2014

Legion: Expressing locality and independence with logical regions
conference, November 2012

Performance analysis of OpenMP on a GPU using a CORAL proxy application
conference, January 2015

A uniform approach for programming distributed heterogeneous computing systems
journal, December 2014

A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters
conference, November 2014

Directive-based Programming Models for Scientific Applications - A Comparison
conference, November 2012

Hetero-mark, a benchmark suite for CPU-GPU collaborative computing
conference, September 2016