Methods for multitasking among real-time embedded compute tasks running on the GPU: Methods for Multitasking Real-time Embedded GPU Computing Tasks

Muyan-Özçelik, Pınar; Owens, John D.

doi:10.1002/cpe.4118

Title: Methods for multitasking among real-time embedded compute tasks running on the GPU: Methods for Multitasking Real-time Embedded GPU Computing Tasks

Abstract

Here, we provide an extensive survey on wide spectrum of scheduling methods for multitasking among graphics processing unit (GPU) computing tasks. We then design several schedulers and explain in detail the selected methods we have developed to implement our scheduling strategies. Next, we compare the performance of schedulers on various workloads running on Fermi and Kepler architectures and arrive at the following major conclusions: (1) Small kernels benefit from running kernels concurrently. (2) The combination of small kernels, high-priority kernels with longer runtimes, and lower-priority kernels with shorter runtimes benefits from a CPU scheduler that dynamically changes kernel order on the Fermi architecture. (3) Because of limitations of existing GPU architectures, currently CPU schedulers outperform their GPU counterparts. We also provide results and observations obtained from implementing and evaluating our schedulers on the NVIDIA Jetson TX1 system-on-chip architecture. We observe that although TX1 has the newer Maxwell architecture, the mechanism used for scheduler timings behaves differently on TX1 compared to Kepler leading to incorrect timings. In this paper, we describe our methods that allow us to report correct timings for CPU schedulers running on TX1. Lastly, we propose new research directions involving the investigation of additional scheduling strategies.

Authors:

Muyan-Özçelik, Pınar ^[1]; Owens, John D. ^[2]

California State Univ., Sacramento, CA (United States)
Univ. of California, Davis, CA (United States)

Publication Date:: Mon Jun 05 00:00:00 EDT 2017

Research Org.:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Org.:: USDOE Office of Science (SC)

OSTI Identifier:: 1528898

Grant/Contract Number:: AC02-05CH11231

Resource Type:: Accepted Manuscript

Journal Name:: Concurrency and Computation. Practice and Experience

Additional Journal Information:: Journal Volume: 29; Journal Issue: 15; Journal ID: ISSN 1532-0626

Publisher:: Wiley

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; GPU computing; multitasking; real‐time embedded tasks

Citation Formats


                    Muyan-Özçelik, Pınar, and Owens, John D. Methods for multitasking among real-time embedded compute tasks running on the GPU: Methods for Multitasking Real-time Embedded GPU Computing Tasks.  United States: N. p., 2017. 
Web.  doi:10.1002/cpe.4118.

Copy to clipboard


                    Muyan-Özçelik, Pınar, & Owens, John D. Methods for multitasking among real-time embedded compute tasks running on the GPU: Methods for Multitasking Real-time Embedded GPU Computing Tasks.  United States.  https://doi.org/10.1002/cpe.4118

Copy to clipboard


                    Muyan-Özçelik, Pınar, and Owens, John D. Mon .  
"Methods for multitasking among real-time embedded compute tasks running on the GPU: Methods for Multitasking Real-time Embedded GPU Computing Tasks".  United States.  https://doi.org/10.1002/cpe.4118.  https://www.osti.gov/servlets/purl/1528898.

Copy to clipboard


                    
@article{osti_1528898,

  title        = {Methods for multitasking among real-time embedded compute tasks running on the GPU: Methods for Multitasking Real-time Embedded GPU Computing Tasks},

  author       = {Muyan-Özçelik, Pınar and Owens, John D.},

  abstractNote = {Here, we provide an extensive survey on wide spectrum of scheduling methods for multitasking among graphics processing unit (GPU) computing tasks. We then design several schedulers and explain in detail the selected methods we have developed to implement our scheduling strategies. Next, we compare the performance of schedulers on various workloads running on Fermi and Kepler architectures and arrive at the following major conclusions: (1) Small kernels benefit from running kernels concurrently. (2) The combination of small kernels, high-priority kernels with longer runtimes, and lower-priority kernels with shorter runtimes benefits from a CPU scheduler that dynamically changes kernel order on the Fermi architecture. (3) Because of limitations of existing GPU architectures, currently CPU schedulers outperform their GPU counterparts. We also provide results and observations obtained from implementing and evaluating our schedulers on the NVIDIA Jetson TX1 system-on-chip architecture. We observe that although TX1 has the newer Maxwell architecture, the mechanism used for scheduler timings behaves differently on TX1 compared to Kepler leading to incorrect timings. In this paper, we describe our methods that allow us to report correct timings for CPU schedulers running on TX1. Lastly, we propose new research directions involving the investigation of additional scheduling strategies.},

  doi          = {10.1002/cpe.4118},

  journal      = {Concurrency and Computation. Practice and Experience},

  number       = 15,

  volume       = 29,

  place        = {United States},

  year         = {Mon Jun 05 00:00:00 EDT 2017},

  month        = {Mon Jun 05 00:00:00 EDT 2017}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1002/cpe.4118

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 1 work

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Softshell: dynamic scheduling on GPUs
journal, November 2012

Steinberger, Markus; Kainz, Bernhard; Kerbl, Bernhard
ACM Transactions on Graphics, Vol. 31, Issue 6
DOI: 10.1145/2366145.2366180

The synchronous languages 12 years later
journal, January 2003

Benveniste, A.; Caspi, P.; Edwards, S. A.
Proceedings of the IEEE, Vol. 91, Issue 1
DOI: 10.1109/JPROC.2002.805826

The ESTEREL language
journal, January 1991

Boussinot, F.; de Simone, R.
Proceedings of the IEEE, Vol. 79, Issue 9
DOI: 10.1109/5.97299

OptiX: a general purpose ray tracing engine
journal, July 2010

Parker, Steven G.; Robison, Austin; Stich, Martin
ACM Transactions on Graphics, Vol. 29, Issue 4
DOI: 10.1145/1778765.1778803

GRAMPS: A programming model for graphics pipelines
journal, January 2009

Sugerman, Jeremy; Fatahalian, Kayvon; Boulos, Solomon
ACM Transactions on Graphics, Vol. 28, Issue 1
DOI: 10.1145/1477926.1477930

The synchronous data flow programming language LUSTRE
journal, January 1991

Halbwachs, N.; Caspi, P.; Raymond, P.
Proceedings of the IEEE, Vol. 79, Issue 9
DOI: 10.1109/5.97300

Programming real-time applications with SIGNAL
journal, January 1991

LeGuernic, P.; Gautier, T.; Le Borgne, M.
Proceedings of the IEEE, Vol. 79, Issue 9
DOI: 10.1109/5.97301

Out-of-core Data Management for Path Tracing on Hybrid Resources
journal, April 2009

Budge, Brian; Bernardin, Tony; Stuart, Jeff A.
Computer Graphics Forum, Vol. 28, Issue 2
DOI: 10.1111/j.1467-8659.2009.01378.x

Real-Time Speed-Limit-Sign Recognition on an Embedded System Using a GPU
book, January 2011

Muyan-Özçelik, Pinar; Glavtchev, Vladimir; Ota, Jeffrey M.
GPU Computing Gems Emerald Edition
DOI: 10.1016/B978-0-12-384988-5.00032-2

Multitasking Real-time Embedded GPU Computing Tasks
conference, January 2016

Muyan-Özçelik, Pιnar; Owens, John D.
Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM'16
DOI: 10.1145/2883404.2883408

Fragment-Parallel Composite and Filter
journal, June 2010

Patney, Anjul; Tzeng, Stanley; Owens, John D.
Computer Graphics Forum, Vol. 29, Issue 4
DOI: 10.1111/j.1467-8659.2010.01720.x

Cooperative Multitasking for GPU-Accelerated Grid Systems
conference, May 2010

Ino, Fumihiko; Ogita, Akihiro; Oita, Kentaro
2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
DOI: 10.1109/CCGRID.2010.18

Efficiently Using a CUDA-enabled GPU as Shared Resource
conference, June 2010

Peters, Hagen; Köper, Martin; Luttenberger, Norbert
2010 IEEE 10th International Conference on Computer and Information Technology (CIT), 2010 10th IEEE International Conference on Computer and Information Technology
DOI: 10.1109/CIT.2010.204

Understanding the efficiency of ray traversal on GPUs
conference, January 2009

Aila, Timo; Laine, Samuli
Proceedings of the 1st ACM conference on High Performance Graphics - HPG '09
DOI: 10.1145/1572769.1572792

Message passing on data-parallel architectures
conference, May 2009

Stuart, Jeff A.; Owens, John D.
Distributed Processing (IPDPS), 2009 IEEE International Symposium on Parallel & Distributed Processing
DOI: 10.1109/IPDPS.2009.5161065

GPU-to-CPU Callbacks
book, January 2011

Stuart, Jeff A.; Cox, Michael; Owens, John D.
Euro-Par 2010 Parallel Processing Workshops
DOI: 10.1007/978-3-642-21878-1_45

Portable and transparent software managed scheduling on accelerators for fair resource sharing
conference, January 2016

Margiolas, Christos; O'Boyle, Michael F. P.
Proceedings of the 2016 International Symposium on Code Generation and Optimization - CGO 2016
DOI: 10.1145/2854038.2854040

PTask: operating system abstractions to manage GPUs as compute devices
conference, January 2011

Rossbach, Christopher J.; Currey, Jon; Silberstein, Mark
Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles - SOSP '11
DOI: 10.1145/2043556.2043579

Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing
conference, March 2016

Wang, Zhenning; Yang, Jun; Melhem, Rami
2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)
DOI: 10.1109/HPCA.2016.7446078

Analyzing CUDA workloads using a detailed GPU simulator
conference, April 2009

Bakhoda, Ali; Yuan, George L.; Fung, Wilson W. L.
Software (ISPASS), 2009 IEEE International Symposium on Performance Analysis of Systems and Software
DOI: 10.1109/ISPASS.2009.4919648

Similar Records in DOE PAGES and OSTI.GOV collections:

ODDS: Real-Time Object Detection Using Depth Sensors on Embedded GPUs

Conference Mithun, Niluthpol Chowdhury ; Munir, Sirajum ; Guo, Karen ; ... - 2018 17th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN)

Detecting objects that are carried when someone enters or exits a room is very useful for a wide range of smart building applications including safety, security, and energy efficiency. While there has been a significant amount of work on object recognition using large-scale RGB image datasets, RGB cameras are too privacy invasive in many smart building applications and they work poorly in the dark. Additionally, deep object detection networks require powerful and expensive GPUs. We propose a novel system that we call ODDS (Object Detector using a Depth Sensor) that can detect objects in real-time using only raw depth datamore »« less
https://doi.org/10.1109/ipsn.2018.00051

Full Text Available
A performance model for GPUs with caches

Journal Article Dao, Thanh Tuan ; Kim, Jungwon ; Seo, Sangmin ; ... - IEEE Transactions on Parallel and Distributed Systems

To exploit the abundant computational power of the world's fastest supercomputers, an even workload distribution to the typically heterogeneous compute devices is necessary. While relatively accurate performance models exist for conventional CPUs, accurate performance estimation models for modern GPUs do not exist. This paper presents two accurate models for modern GPUs: a sampling-based linear model, and a model based on machine-learning (ML) techniques which improves the accuracy of the linear model and is applicable to modern GPUs with and without caches. We first construct the sampling-based linear model to predict the runtime of an arbitrary OpenCL kernel. Based on anmore »« less
Cited by 25
https://doi.org/10.1109/TPDS.2014.2333526

Full Text Available
Computational Particle Dynamic Simulations on Multicore Processors (CPDMu) Final Report Phase I

Technical Report Schmalz, Mark S

Statement of Problem - Department of Energy has many legacy codes for simulation of computational particle dynamics and computational fluid dynamics applications that are designed to run on sequential processors and are not easily parallelized. Emerging high-performance computing architectures employ massively parallel multicore architectures (e.g., graphics processing units) to increase throughput. Parallelization of legacy simulation codes is a high priority, to achieve compatibility, efficiency, accuracy, and extensibility. General Statement of Solution - A legacy simulation application designed for implementation on mainly-sequential processors has been represented as a graph G. Mathematical transformations, applied to G, produce a graph representation {und G}more »« less
https://doi.org/10.2172/1019271

Full Text Available
Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report Shen, Xipeng

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardwaremore »« less
https://doi.org/10.2172/1576175

Full Text Available
Evaluating Multi-core Architectures through Accelerating the Three-Dimensional Lax–Wendroff Correction

Journal Article You, Yang ; Fu, Haohuan ; Song, Shuaiwen ; ... - International Journal of High Performance Computing Applications, 28(3):301-318

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more »« less
https://doi.org/10.1177/1094342014524807

Similar Records

Title: Methods for multitasking among real-time embedded compute tasks running on the GPU: Methods for Multitasking Real-time Embedded GPU Computing Tasks

Abstract

Citation Formats

Softshell: dynamic scheduling on GPUs journal, November 2012

The synchronous languages 12 years later journal, January 2003

The ESTEREL language journal, January 1991

OptiX: a general purpose ray tracing engine journal, July 2010

GRAMPS: A programming model for graphics pipelines journal, January 2009

The synchronous data flow programming language LUSTRE journal, January 1991

Programming real-time applications with SIGNAL journal, January 1991

Out-of-core Data Management for Path Tracing on Hybrid Resources journal, April 2009

Real-Time Speed-Limit-Sign Recognition on an Embedded System Using a GPU book, January 2011

Multitasking Real-time Embedded GPU Computing Tasks conference, January 2016

Fragment-Parallel Composite and Filter journal, June 2010

Cooperative Multitasking for GPU-Accelerated Grid Systems conference, May 2010

Efficiently Using a CUDA-enabled GPU as Shared Resource conference, June 2010

Understanding the efficiency of ray traversal on GPUs conference, January 2009

Message passing on data-parallel architectures conference, May 2009

GPU-to-CPU Callbacks book, January 2011

Portable and transparent software managed scheduling on accelerators for fair resource sharing conference, January 2016

PTask: operating system abstractions to manage GPUs as compute devices conference, January 2011

Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing conference, March 2016

Analyzing CUDA workloads using a detailed GPU simulator conference, April 2009

Softshell: dynamic scheduling on GPUs
journal, November 2012

The synchronous languages 12 years later
journal, January 2003

The ESTEREL language
journal, January 1991

OptiX: a general purpose ray tracing engine
journal, July 2010

GRAMPS: A programming model for graphics pipelines
journal, January 2009

The synchronous data flow programming language LUSTRE
journal, January 1991

Programming real-time applications with SIGNAL
journal, January 1991

Out-of-core Data Management for Path Tracing on Hybrid Resources
journal, April 2009

Real-Time Speed-Limit-Sign Recognition on an Embedded System Using a GPU
book, January 2011

Multitasking Real-time Embedded GPU Computing Tasks
conference, January 2016

Fragment-Parallel Composite and Filter
journal, June 2010

Cooperative Multitasking for GPU-Accelerated Grid Systems
conference, May 2010

Efficiently Using a CUDA-enabled GPU as Shared Resource
conference, June 2010

Understanding the efficiency of ray traversal on GPUs
conference, January 2009

Message passing on data-parallel architectures
conference, May 2009

GPU-to-CPU Callbacks
book, January 2011

Portable and transparent software managed scheduling on accelerators for fair resource sharing
conference, January 2016

PTask: operating system abstractions to manage GPUs as compute devices
conference, January 2011

Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing
conference, March 2016

Analyzing CUDA workloads using a detailed GPU simulator
conference, April 2009