Concepts for OpenMP Target Offload Resilience
- ORNL
Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale heterogeneous computing. OpenMP currently does not address error and failure behavior. This paper takes a first step toward resilience for heterogeneous systems by providing the concepts for resilient OpenMP offload to devices. Using real-world error and failure observations, the paper describes the concepts and terminology for resilient OpenMP target offload, including error and failure classes and resilience strategies. It details the experienced general-purpose computing graphics processing unit (GPGPU) errors and failures in Titan. It further proposes improvements in OpenMP, including a preliminary prototype design, to support resilient offload to devices for efficient handling of errors and failures in heterogeneous high-performance computing (HPC) systems.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1570122
- Resource Relation:
- Journal Volume: 11718; Conference: 15th International Workshop on OpenMP (IWOMP 2019) - AUCKLAND, , New Zealand - 9/11/2019 12:00:00 PM-9/13/2019 12:00:00 PM
- Country of Publication:
- United States
- Language:
- English
DINO: Divergent node cloning for sustained redundancy in HPC
|
journal | November 2017 |
Detection and correction of silent data corruption for large-scale high-performance computing
|
conference | November 2012 |
Rolex: resilience-oriented language extensions for extreme-scale systems
|
journal | May 2016 |
FTI: high performance fault tolerance interface for hybrid systems
|
conference | January 2011 |
The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems
|
conference | November 2018 |
Failures in large scale systems: long-term measurement, analysis, and implications
|
conference | January 2017 |
GPU Age-Aware Scheduling to Improve the Reliability of Leadership Jobs on Titan
|
conference | November 2018 |
Post-failure recovery of MPI communication capability: Design and rationale
|
journal | June 2013 |
PMIx
|
conference | September 2017 |
Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing
|
conference | March 2018 |
Self-stabilizing iterative solvers
|
conference | January 2013 |
VOCL-FT: introducing techniques for efficient soft error coprocessor recovery
|
conference | November 2015 |
Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
|
conference | June 2014 |
Evaluating the Impact of SDC on the GMRES Iterative Solver
|
conference | May 2014 |
Addressing failures in exascale computing
|
journal | March 2014 |
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System
|
conference | June 2018 |
Correcting soft errors online in LU factorization
|
conference | January 2013 |
Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities
|
conference | September 2017 |
Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems
|
conference | November 2012 |
Similar Records
OpenMP Target Task: Tasking and Target Offloading on Heterogeneous Systems
OpenMP 4.5 Validation and Verification Suite for Device Offload