Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime

Peterson, Brad; Humphrey, Alan; Sunderland, Dan; Sutherland, James; Saad, Tony; Dasari, Harish; Berzins, Martin

doi:10.1007/s10766-018-0619-1

Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime

Journal Article · Fri Dec 07 04:00:00 EST 2018 · International Journal of Parallel Programming

DOI:https://doi.org/10.1007/s10766-018-0619-1· OSTI ID:1567537

Peterson, Brad; Humphrey, Alan; Sunderland, Dan; Sutherland, James; Saad, Tony; Dasari, Harish; Berzins, Martin

The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime system. Uintah is based on a distributed directed acyclic graph of computational tasks, with a task scheduler that efficiently schedules and executes these tasks on both CPU cores and on-node accelerators. The runtime system identifies task dependencies, creates a task graph prior to the execution of these tasks, automatically generates MPI message tags, and automatically performs halo transfers for simulation variables. Automating halo transfers in a heterogeneous environment poses significant challenges when tasks compute within a few milliseconds, as runtime overhead affects wall time execution, or when simulation variables require large halos spanning most or all of the computational domain, as task dependencies become expensive to process. These challenges are magnified at production scale when application developers require each compute node perform thousands of different halo transfers among thousands simulation variables. The principal contribution of this work is to (1) identify and address inefficiencies that arise when mapping tasks onto the GPU in the presence of automated halo transfers, (2) implement new schemes to reduce runtime system overhead, (3) minimize application developer involvement with the runtime, and (4) show overhead reduction results from these improvements.

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)

Sponsoring Organization:: USDOE Office of Science; USDOE National Nuclear Security Administration (NNSA)

DOE Contract Number:: NA0002375; AC05-00OR22725

OSTI ID:: 1567537

Journal Information:: International Journal of Parallel Programming, Journal Name: International Journal of Parallel Programming Journal Issue: 5-6 Vol. 47; ISSN 0885-7458

Publisher:: Springer

Country of Publication:: United States

Language:: English

References (22)

Addressing Global Data Dependencies in Heterogeneous Asynchronous Runtime Systems on GPUs Peterson, Brad; Humphrey, Alan; Schmidt, John Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware - ESPM2'17 https://doi.org/10.1145/3152041.3152082	conference	January 2017
PTG: An Abstraction for Unhindered Parallelism Danalis, Anthony; Bosilca, George; Bouteiller, Aurelien 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC) https://doi.org/10.1109/WOLFHPC.2014.8	conference	November 2014
Radiative Heat Transfer Calculation on 16384 GPUs Using a Reverse Monte Carlo Ray Tracing Approach with Adaptive Mesh Refinement Humphrey, Alan; Sunderland, Daniel; Harman, Todd 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2016.93	conference	May 2016
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures Augonnet, Cédric; Thibault, Samuel; Namyst, Raymond Concurrency and Computation: Practice and Experience, Vol. 23, Issue 2 https://doi.org/10.1002/cpe.1631	journal	November 2010
Graph-Based Software Design for Managing Complexity and Enabling Concurrency in Multiphysics PDE Software Notz, Patrick K.; Pawlowski, Roger P.; Sutherland, James C. ACM Transactions on Mathematical Software, Vol. 39, Issue 1 https://doi.org/10.1145/2382585.2382586	journal	November 2012
Massively Parallel Simulations of Spread of Infectious Diseases over Realistic Social Networks Bhatele, Abhinav; Yeom, Jae-Seung; Jain, Nikhil 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) https://doi.org/10.1109/CCGRID.2017.141	conference	May 2017
Performance Portability of a GPU Enabled Factorization with the DAGuE Framework Bosilca, George; Bouteiller, Aurelien; Herault, Thomas 2011 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2011.51	conference	September 2011
An investigation of Unified Memory Access performance in CUDA Landaverde, Raphael; Coskun, Ayse K. 2014 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC.2014.7040988	conference	September 2014
Using hybrid parallelism to improve memory use in the Uintah framework Meng, Qingyu; Berzins, Martin; Schmidt, John Proceedings of the 2011 TeraGrid Conference on Extreme Digital Discovery - TG '11 https://doi.org/10.1145/2016741.2016767	conference	January 2011
Regent: a high-productivity programming language for HPC with logical regions Slaughter, Elliott; Lee, Wonchan; Treichler, Sean Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807629	conference	January 2015
Nebo: An efficient, parallel, and portable domain-specific language for numerically solving partial differential equations Earl, Christopher; Might, Matthew; Bagusetty, Abhishek Journal of Systems and Software, Vol. 125 https://doi.org/10.1016/j.jss.2016.01.023	journal	March 2017
GPU-Aware Non-contiguous Data Movement In Open MPI Wu, Wei; Bosilca, George; vandeVaart, Rolf Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16 https://doi.org/10.1145/2907294.2907317	conference	January 2016
Wasatch: An architecture-proof multiphysics development environment using a Domain Specific Language and graph theory Saad, Tony; Sutherland, James C. Journal of Computational Science, Vol. 17 https://doi.org/10.1016/j.jocs.2016.04.010	journal	November 2016
Kokkos Array performance-portable manycore programming model Edwards, H. Carter; Sunderland, Daniel Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM '12 https://doi.org/10.1145/2141702.2141703	conference	January 2012
DAGuE: A generic distributed DAG engine for High Performance Computing Bosilca, George; Bouteiller, Aurelien; Danalis, Anthony Parallel Computing, Vol. 38, Issue 1-2 https://doi.org/10.1016/j.parco.2011.10.003	journal	January 2012
Investigating applications portability with the Uintah DAG-based runtime system on PetaScale supercomputers Meng, Qingyu; Humphrey, Alan; Schmidt, John Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503250	conference	January 2013
Reducing overhead in the Uintah framework to support short-lived tasks on GPU-heterogeneous architectures Peterson, Brad; Dasari, Harish; Humphrey, Alan Proceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing - WOLFHPC '15 https://doi.org/10.1145/2830018.2830023	conference	January 2015
CHARM++: a portable concurrent object oriented system based on C++ Kale, Laxmikant V.; Krishnan, Sanjeev ACM SIGPLAN Notices, Vol. 28, Issue 10 https://doi.org/10.1145/167962.165874	journal	October 1993
Spatial Domain-Based Parallelism in Large-Scale, Participating-Media, Radiative Transport Applications Burns, Shawn P.; Christen, Mark A. Numerical Heat Transfer, Part B: Fundamentals, Vol. 31, Issue 4 https://doi.org/10.1080/10407799708915117	journal	June 1997
The Discrete Operator Approach to the Numerical Solution of Partial Differential Equations Sutherland, James C.; Saad, Tony 20th AIAA Computational Fluid Dynamics Conference https://doi.org/10.2514/6.2011-3377	conference	June 2012
Extending the Uintah Framework through the Petascale Modeling of Detonation in Arrays of High Explosive Devices Berzins, Martin; Beckvermit, Jacqueline; Harman, Todd SIAM Journal on Scientific Computing, Vol. 38, Issue 5 https://doi.org/10.1137/15M1023270	journal	January 2016
Radiation modeling using the Uintah heterogeneous CPU/GPU runtime system Humphrey, Alan; Meng, Qingyu; Berzins, Martin Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment on Bridging from the eXtreme to the campus and beyond - XSEDE '12 https://doi.org/10.1145/2335755.2335791	conference	January 2012

Similar Records

Addressing Global Data Dependencies in Heterogeneous Asynchronous Runtime Systems on GPUs

Conference · Wed Nov 01 00:00:00 EDT 2017 · Proceedings of the 3rd International IEEE Workshop on Extreme Scale Programming Models and Middleware · OSTI ID:1582428

The uintah framework: a unified heterogeneous task scheduling and runtime system

Conference · Thu Nov 01 00:00:00 EDT 2012 · 2012 SC Companion: High Performance Computing, Networking Storage and Analysis; 10-16 Nov. 2012; Salt Lake City, UT, USA · OSTI ID:1567606

Investigating applications portability with the Uintah DAG-based runtime system on PetaScale supercomputers

Conference · Mon Dec 31 23:00:00 EST 2012 · OSTI ID:1567631

Related Subjects

Concurrency
GPU
Halo transfer
Heterogeneous systems
Hybrid parallelism
Optimization
Parallel
Stencil computation
Uintah

Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime

Citation Formats

References (22)

Similar Records

Related Subjects