skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Juggler: a dependence-aware task-based execution framework for GPUs

Conference ·

Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may significantly impact the speedup by requiring global synchronization across multiprocessors (SMs) inside the GPU. To efficiently run applications with interblock data dependences, we need fine-granular task-based execution models that will treat SMs inside a GPU as stand-alone parallel processing units. Such a scheme will enable faster execution by utilizing all internal computation elements inside the GPU and eliminating unnecessary waits during device-wide global barriers.In this paper, we propose Juggler, a task-based execution scheme for GPU workloads with data dependences. The Juggler framework takes applications embedding OpenMP 4.5 tasks as input and executes them on the GPU via an efficient in-device runtime, hence eliminating the need for kernel-wide global synchronization. Juggler requires no or little modification to the source code, and once launched, the runtime entirely runs on the GPU without relying on the host through the entire execution. We have evaluated Juggler on an NVIDIA Tesla P100 GPU and obtained up to 31% performance improvement against global barrier based implementation, with minimal runtime overhead.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1430605
Resource Relation:
Conference: Principles and Practice of Parallel Programming (PPoPP) 2018 - Vienna, , Austria - 2/26/2018 10:00:00 AM-2/28/2018 9:00:00 AM
Country of Publication:
United States
Language:
English

References (26)

PTask: operating system abstractions to manage GPUs as compute devices conference January 2011
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
  • Agullo, Emmanuel; Augonnet, Cedric; Dongarra, Jack
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.90
conference May 2011
Wireframe: supporting data-dependent parallelism through dependency graph execution in GPUs
  • Abdolrashidi, Amir Ali; Tripathy, Devashree; Belviranli, Mehmet Esat
  • Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-50 '17 https://doi.org/10.1145/3123939.3123976
conference January 2017
PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization conference January 2015
Adaptive heterogeneous scheduling for integrated GPUs
  • Kaleem, Rashid; Barik, Rajkishore; Shpeisman, Tatiana
  • Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14 https://doi.org/10.1145/2628071.2628088
conference January 2014
OpenARC: open accelerator research compiler for directive-based, efficient heterogeneous computing conference January 2014
Processing data streams with hard real-time constraints on heterogeneous systems conference January 2011
Multitasking Real-time Embedded GPU Computing Tasks
  • Muyan-Özçelik, Pιnar; Owens, John D.
  • Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM'16 https://doi.org/10.1145/2883404.2883408
conference January 2016
Design and evaluation of the gemtc framework for GPU-enabled many-task computing
  • Krieder, Scott J.; Wozniak, Justin M.; Armstrong, Timothy
  • Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14 https://doi.org/10.1145/2600212.2600228
conference January 2014
Free launch: optimizing GPU dynamic kernel launches through thread reuse conference January 2015
StreamScan: fast scan algorithms for GPUs without global barrier synchronization conference January 2013
Softshell: dynamic scheduling on GPUs journal November 2012
Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks
  • Yeh, Tsung Tai; Sabne, Amit; Sakdhnagool, Putt
  • Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '17 https://doi.org/10.1145/3018743.3018754
conference January 2017
Dynamic load balancing on single- and multi-GPU systems conference April 2010
Inter-block GPU communication via fast barrier synchronization conference April 2010
XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures
  • Gautier, Thierry; Lima, Joao V. F.; Maillard, Nicolas
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.66
conference May 2013
Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes conference January 2015
Scalable framework for mapping streaming applications onto multi-GPU systems
  • Huynh, Huynh Phung; Hagiescu, Andrei; Wong, Weng-Fai
  • Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12 https://doi.org/10.1145/2145816.2145818
conference January 2012
Versapipe: a versatile programming framework for pipelined computing on GPU conference January 2017
CudaDMA: optimizing GPU memory bandwidth via warp specialization
  • Bauer, Michael; Cook, Henry; Khailany, Brucek
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063400
conference January 2011
Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations conference January 2015
A GPU Task-Parallel Model with Dependency Resolution journal August 2012
Singe: leveraging warp specialization for high performance on GPUs
  • Bauer, Michael; Treichler, Sean; Aiken, Alex
  • Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14 https://doi.org/10.1145/2555243.2555258
conference January 2014
Improving GPGPU resource utilization through alternative thread block scheduling conference February 2014
Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing conference March 2016
Understanding the efficiency of ray traversal on GPUs conference January 2009