Juggler: a dependence-aware task-based execution framework for GPUs

Belviranli, Mehmet E.; Lee, Seyong; Vetter, Jeffrey S.; Bhuyan, Laxmi N.

doi:10.1145/3178487.3178492

Title: Juggler: a dependence-aware task-based execution framework for GPUs

Conference · Thu Feb 01 00:00:00 EST 2018

DOI:https://doi.org/10.1145/3178487.3178492· OSTI ID:1430605

^[1];

^[1]; Bhuyan, Laxmi N. ^[2]

ORNL
University of California, Riverside

Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may significantly impact the speedup by requiring global synchronization across multiprocessors (SMs) inside the GPU. To efficiently run applications with interblock data dependences, we need fine-granular task-based execution models that will treat SMs inside a GPU as stand-alone parallel processing units. Such a scheme will enable faster execution by utilizing all internal computation elements inside the GPU and eliminating unnecessary waits during device-wide global barriers.In this paper, we propose Juggler, a task-based execution scheme for GPU workloads with data dependences. The Juggler framework takes applications embedding OpenMP 4.5 tasks as input and executes them on the GPU via an efficient in-device runtime, hence eliminating the need for kernel-wide global synchronization. Juggler requires no or little modification to the source code, and once launched, the runtime entirely runs on the GPU without relying on the host through the entire execution. We have evaluated Juggler on an NVIDIA Tesla P100 GPU and obtained up to 31% performance improvement against global barrier based implementation, with minimal runtime overhead.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1430605

Resource Relation:: Conference: Principles and Practice of Parallel Programming (PPoPP) 2018 - Vienna, , Austria - 2/26/2018 10:00:00 AM-2/28/2018 9:00:00 AM

Country of Publication:: United States

Language:: English

References (26)

PTask: operating system abstractions to manage GPUs as compute devices Rossbach, Christopher J.; Currey, Jon; Silberstein, Mark Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles - SOSP '11 https://doi.org/10.1145/2043556.2043579	conference	January 2011
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators Agullo, Emmanuel; Augonnet, Cedric; Dongarra, Jack Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.90	conference	May 2011
Wireframe: supporting data-dependent parallelism through dependency graph execution in GPUs Abdolrashidi, Amir Ali; Tripathy, Devashree; Belviranli, Mehmet Esat Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-50 '17 https://doi.org/10.1145/3123939.3123976	conference	January 2017
PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization Belviranli, Mehmet E.; Deng, Peng; Bhuyan, Laxmi N. Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15 https://doi.org/10.1145/2751205.2751243	conference	January 2015
Adaptive heterogeneous scheduling for integrated GPUs Kaleem, Rashid; Barik, Rajkishore; Shpeisman, Tatiana Proceedings of the 23rd international conference on Parallel architectures and compilation - PACT '14 https://doi.org/10.1145/2628071.2628088	conference	January 2014
OpenARC: open accelerator research compiler for directive-based, efficient heterogeneous computing Lee, Seyong; Vetter, Jeffrey S. Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14 https://doi.org/10.1145/2600212.2600704	conference	January 2014
Processing data streams with hard real-time constraints on heterogeneous systems Verner, Uri; Schuster, Assaf; Silberstein, Mark Proceedings of the international conference on Supercomputing - ICS '11 https://doi.org/10.1145/1995896.1995915	conference	January 2011
Multitasking Real-time Embedded GPU Computing Tasks Muyan-Özçelik, Pιnar; Owens, John D. Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM'16 https://doi.org/10.1145/2883404.2883408	conference	January 2016
Design and evaluation of the gemtc framework for GPU-enabled many-task computing Krieder, Scott J.; Wozniak, Justin M.; Armstrong, Timothy Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14 https://doi.org/10.1145/2600212.2600228	conference	January 2014
Free launch: optimizing GPU dynamic kernel launches through thread reuse Chen, Guoyang; Shen, Xipeng Proceedings of the 48th International Symposium on Microarchitecture - MICRO-48 https://doi.org/10.1145/2830772.2830818	conference	January 2015
StreamScan: fast scan algorithms for GPUs without global barrier synchronization Yan, Shengen; Long, Guoping; Zhang, Yunquan Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13 https://doi.org/10.1145/2442516.2442539	conference	January 2013
Softshell: dynamic scheduling on GPUs Steinberger, Markus; Kainz, Bernhard; Kerbl, Bernhard ACM Transactions on Graphics, Vol. 31, Issue 6 https://doi.org/10.1145/2366145.2366180	journal	November 2012
Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks Yeh, Tsung Tai; Sabne, Amit; Sakdhnagool, Putt Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '17 https://doi.org/10.1145/3018743.3018754	conference	January 2017
Dynamic load balancing on single- and multi-GPU systems Chen, Long; Villa, Oreste; Krishnamoorthy, Sriram 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) https://doi.org/10.1109/IPDPS.2010.5470413	conference	April 2010
Inter-block GPU communication via fast barrier synchronization 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) https://doi.org/10.1109/IPDPS.2010.5470477	conference	April 2010
XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures Gautier, Thierry; Lima, Joao V. F.; Maillard, Nicolas 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.66	conference	May 2013
Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes Cabezas, Javier; Vilanova, Lluís; Gelado, Isaac Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15 https://doi.org/10.1145/2751205.2751218	conference	January 2015
Scalable framework for mapping streaming applications onto multi-GPU systems Huynh, Huynh Phung; Hagiescu, Andrei; Wong, Weng-Fai Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming - PPoPP '12 https://doi.org/10.1145/2145816.2145818	conference	January 2012
Versapipe: a versatile programming framework for pipelined computing on GPU Zheng, Zhen; Oh, Chanyoung; Zhai, Jidong Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-50 '17 https://doi.org/10.1145/3123939.3123978	conference	January 2017
CudaDMA: optimizing GPU memory bandwidth via warp specialization Bauer, Michael; Cook, Henry; Khailany, Brucek Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063400	conference	January 2011
Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations Wu, Bo; Chen, Guoyang; Li, Dong Proceedings of the 29th ACM on International Conference on Supercomputing - ICS '15 https://doi.org/10.1145/2751205.2751213	conference	January 2015
A GPU Task-Parallel Model with Dependency Resolution Tzeng, Stanley; Lloyd, Brandon; Owens, John D. Computer, Vol. 45, Issue 8 https://doi.org/10.1109/MC.2012.255	journal	August 2012
Singe: leveraging warp specialization for high performance on GPUs Bauer, Michael; Treichler, Sean; Aiken, Alex Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14 https://doi.org/10.1145/2555243.2555258	conference	January 2014
Improving GPGPU resource utilization through alternative thread block scheduling Lee, Minseok; Song, Seokwoo; Moon, Joosik 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) https://doi.org/10.1109/HPCA.2014.6835937	conference	February 2014
Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing Wang, Zhenning; Yang, Jun; Melhem, Rami 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) https://doi.org/10.1109/HPCA.2016.7446078	conference	March 2016
Understanding the efficiency of ray traversal on GPUs Aila, Timo; Laine, Samuli Proceedings of the 1st ACM conference on High Performance Graphics - HPG '09 https://doi.org/10.1145/1572769.1572792	conference	January 2009

Similar Records

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

Conference · Mon Aug 09 00:00:00 EDT 2021 · OSTI ID:1430605

Xie, Chenhao; Chen, Jieyang; Firoz, Jesun S.; +5 more

GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM

Conference · Fri Dec 01 00:00:00 EST 2017 · OSTI ID:1430605

Potluri, Sreeram; Goswami, Anshuman; Rossetti, Davide; +3 more

Novel Parallel Algorithms for Fast Multi-GPU-Based Generation of Massive Scale-Free Networks

Journal Article · Sat Mar 30 00:00:00 EDT 2019 · Data Science and Engineering · OSTI ID:1430605

Alam, Maksudul; Perumalla, Kalyan S.; Sanders, Peter

Title: Juggler: a dependence-aware task-based execution framework for GPUs

Citation Formats

References (26)

Similar Records

Related Subjects