Juggler: a dependence-aware task-based execution framework for GPUs
- ORNL
- University of California, Riverside
Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may significantly impact the speedup by requiring global synchronization across multiprocessors (SMs) inside the GPU. To efficiently run applications with interblock data dependences, we need fine-granular task-based execution models that will treat SMs inside a GPU as stand-alone parallel processing units. Such a scheme will enable faster execution by utilizing all internal computation elements inside the GPU and eliminating unnecessary waits during device-wide global barriers.In this paper, we propose Juggler, a task-based execution scheme for GPU workloads with data dependences. The Juggler framework takes applications embedding OpenMP 4.5 tasks as input and executes them on the GPU via an efficient in-device runtime, hence eliminating the need for kernel-wide global synchronization. Juggler requires no or little modification to the source code, and once launched, the runtime entirely runs on the GPU without relying on the host through the entire execution. We have evaluated Juggler on an NVIDIA Tesla P100 GPU and obtained up to 31% performance improvement against global barrier based implementation, with minimal runtime overhead.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1430605
- Resource Relation:
- Conference: Principles and Practice of Parallel Programming (PPoPP) 2018 - Vienna, , Austria - 2/26/2018 10:00:00 AM-2/28/2018 9:00:00 AM
- Country of Publication:
- United States
- Language:
- English
PTask: operating system abstractions to manage GPUs as compute devices
|
conference | January 2011 |
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
|
conference | May 2011 |
Wireframe: supporting data-dependent parallelism through dependency graph execution in GPUs
|
conference | January 2017 |
PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization
|
conference | January 2015 |
Adaptive heterogeneous scheduling for integrated GPUs
|
conference | January 2014 |
OpenARC: open accelerator research compiler for directive-based, efficient heterogeneous computing
|
conference | January 2014 |
Processing data streams with hard real-time constraints on heterogeneous systems
|
conference | January 2011 |
Multitasking Real-time Embedded GPU Computing Tasks
|
conference | January 2016 |
Design and evaluation of the gemtc framework for GPU-enabled many-task computing
|
conference | January 2014 |
Free launch: optimizing GPU dynamic kernel launches through thread reuse
|
conference | January 2015 |
StreamScan: fast scan algorithms for GPUs without global barrier synchronization
|
conference | January 2013 |
Softshell: dynamic scheduling on GPUs
|
journal | November 2012 |
Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks
|
conference | January 2017 |
Dynamic load balancing on single- and multi-GPU systems
|
conference | April 2010 |
Inter-block GPU communication via fast barrier synchronization
|
conference | April 2010 |
XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures
|
conference | May 2013 |
Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes
|
conference | January 2015 |
Scalable framework for mapping streaming applications onto multi-GPU systems
|
conference | January 2012 |
Versapipe: a versatile programming framework for pipelined computing on GPU
|
conference | January 2017 |
CudaDMA: optimizing GPU memory bandwidth via warp specialization
|
conference | January 2011 |
Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations
|
conference | January 2015 |
A GPU Task-Parallel Model with Dependency Resolution
|
journal | August 2012 |
Singe: leveraging warp specialization for high performance on GPUs
|
conference | January 2014 |
Improving GPGPU resource utilization through alternative thread block scheduling
|
conference | February 2014 |
Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing
|
conference | March 2016 |
Understanding the efficiency of ray traversal on GPUs
|
conference | January 2009 |
Similar Records
GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM
Novel Parallel Algorithms for Fast Multi-GPU-Based Generation of Massive Scale-Free Networks