skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Concepts for OpenMP Target Offload Resilience

Conference ·

Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale heterogeneous computing. OpenMP currently does not address error and failure behavior. This paper takes a first step toward resilience for heterogeneous systems by providing the concepts for resilient OpenMP offload to devices. Using real-world error and failure observations, the paper describes the concepts and terminology for resilient OpenMP target offload, including error and failure classes and resilience strategies. It details the experienced general-purpose computing graphics processing unit (GPGPU) errors and failures in Titan. It further proposes improvements in OpenMP, including a preliminary prototype design, to support resilient offload to devices for efficient handling of errors and failures in heterogeneous high-performance computing (HPC) systems.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1570122
Resource Relation:
Journal Volume: 11718; Conference: 15th International Workshop on OpenMP (IWOMP 2019) - AUCKLAND, , New Zealand - 9/11/2019 12:00:00 PM-9/13/2019 12:00:00 PM
Country of Publication:
United States
Language:
English

References (19)

DINO: Divergent node cloning for sustained redundancy in HPC journal November 2017
Detection and correction of silent data corruption for large-scale high-performance computing
  • Fiala, David; Mueller, Frank; Engelmann, Christian
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.49
conference November 2012
Rolex: resilience-oriented language extensions for extreme-scale systems journal May 2016
FTI: high performance fault tolerance interface for hybrid systems
  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427
conference January 2011
The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems
  • Vazhkudai, Sudharshan S.; de Supinski, Bronis R.; Bland, Arthur S.
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00055
conference November 2018
Failures in large scale systems: long-term measurement, analysis, and implications
  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937
conference January 2017
GPU Age-Aware Scheduling to Improve the Reliability of Leadership Jobs on Titan
  • Zimmer, Christopher; Maxwell, Don; McNally, Stephen
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00010
conference November 2018
Post-failure recovery of MPI communication capability: Design and rationale journal June 2013
PMIx conference September 2017
Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing
  • Ashraf, Rizwan A.; Hukerikar, Saurabh; Engelmann, Christian
  • ICPE '18: ACM/SPEC International Conference on Performance Engineering, Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering https://doi.org/10.1145/3184407.3184421
conference March 2018
Self-stabilizing iterative solvers conference January 2013
VOCL-FT: introducing techniques for efficient soft error coprocessor recovery
  • Peña, Antonio J.; Bland, Wesley; Balaji, Pavan
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807640
conference November 2015
Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
  • Hassani, Amin; Skjellum, Anthony; Brightwell, Ron
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.78
conference June 2014
Evaluating the Impact of SDC on the GMRES Iterative Solver
  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.123
conference May 2014
Addressing failures in exascale computing journal March 2014
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System conference June 2018
Correcting soft errors online in LU factorization conference January 2013
Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities
  • Nie, Bin; Xue, Ji; Gupta, Saurabh
  • 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) https://doi.org/10.1109/MASCOTS.2017.12
conference September 2017
Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems
  • Chung, Jinsuk; Lee, Ikhwan; Sullivan, Michael
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.36
conference November 2012