Concepts for OpenMP Target Offload Resilience

Engelmann, Christian; Vallee, Geoffroy; Pophale, Swaroop

doi:10.1007/978-3-030-28596-8_6

Title: Concepts for OpenMP Target Offload Resilience

Conference · Thu Aug 01 00:00:00 EDT 2019

DOI:https://doi.org/10.1007/978-3-030-28596-8_6· OSTI ID:1570122

^[1];

^[1]

ORNL

Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale heterogeneous computing. OpenMP currently does not address error and failure behavior. This paper takes a first step toward resilience for heterogeneous systems by providing the concepts for resilient OpenMP offload to devices. Using real-world error and failure observations, the paper describes the concepts and terminology for resilient OpenMP target offload, including error and failure classes and resilience strategies. It details the experienced general-purpose computing graphics processing unit (GPGPU) errors and failures in Titan. It further proposes improvements in OpenMP, including a preliminary prototype design, to support resilient offload to devices for efficient handling of errors and failures in heterogeneous high-performance computing (HPC) systems.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1570122

Resource Relation:: Journal Volume: 11718; Conference: 15th International Workshop on OpenMP (IWOMP 2019) - AUCKLAND, , New Zealand - 9/11/2019 12:00:00 PM-9/13/2019 12:00:00 PM

Country of Publication:: United States

Language:: English

References (19)

DINO: Divergent node cloning for sustained redundancy in HPC Rezaei, Arash; Mueller, Frank; Hargrove, Paul Journal of Parallel and Distributed Computing, Vol. 109 https://doi.org/10.1016/j.jpdc.2017.06.010	journal	November 2017
Detection and correction of silent data corruption for large-scale high-performance computing Fiala, David; Mueller, Frank; Engelmann, Christian 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.49	conference	November 2012
Rolex: resilience-oriented language extensions for extreme-scale systems Hukerikar, Saurabh; Lucas, Robert F. The Journal of Supercomputing, Vol. 72, Issue 12 https://doi.org/10.1007/s11227-016-1752-5	journal	May 2016
FTI: high performance fault tolerance interface for hybrid systems Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427	conference	January 2011
The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems Vazhkudai, Sudharshan S.; de Supinski, Bronis R.; Bland, Arthur S. SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00055	conference	November 2018
Failures in large scale systems: long-term measurement, analysis, and implications Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937	conference	January 2017
GPU Age-Aware Scheduling to Improve the Reliability of Leadership Jobs on Titan Zimmer, Christopher; Maxwell, Don; McNally, Stephen SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00010	conference	November 2018
Post-failure recovery of MPI communication capability: Design and rationale Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas The International Journal of High Performance Computing Applications, Vol. 27, Issue 3 https://doi.org/10.1177/1094342013488238	journal	June 2013
PMIx Castain, Ralph H.; Solt, David; Hursey, Joshua Proceedings of the 24th European MPI Users' Group Meeting https://doi.org/10.1145/3127024.3127027	conference	September 2017
Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing Ashraf, Rizwan A.; Hukerikar, Saurabh; Engelmann, Christian ICPE '18: ACM/SPEC International Conference on Performance Engineering, Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering https://doi.org/10.1145/3184407.3184421	conference	March 2018
Self-stabilizing iterative solvers Sao, Piyush; Vuduc, Richard Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '13 https://doi.org/10.1145/2530268.2530272	conference	January 2013
VOCL-FT: introducing techniques for efficient soft error coprocessor recovery Peña, Antonio J.; Bland, Wesley; Balaji, Pavan SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807640	conference	November 2015
Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI Hassani, Amin; Skjellum, Anthony; Brightwell, Ron 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.78	conference	June 2014
Evaluating the Impact of SDC on the GMRES Iterative Solver Elliott, James; Hoemmen, Mark; Mueller, Frank 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.123	conference	May 2014
Addressing failures in exascale computing Snir, Marc; Wisniewski, Robert W.; Abraham, Jacob A. The International Journal of High Performance Computing Applications, Vol. 28, Issue 2 https://doi.org/10.1177/1094342014522573	journal	March 2014
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System Nie, Bin; Xue, Ji; Gupta, Saurabh 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2018.00022	conference	June 2018
Correcting soft errors online in LU factorization Davies, Teresa; Chen, Zizhong Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13 https://doi.org/10.1145/2493123.2462920	conference	January 2013
Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities Nie, Bin; Xue, Ji; Gupta, Saurabh 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) https://doi.org/10.1109/MASCOTS.2017.12	conference	September 2017
Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems Chung, Jinsuk; Lee, Ikhwan; Sullivan, Michael 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.36	conference	November 2012

Similar Records

COMPOFF: A Compiler Cost model using Machine Learning to predict the Cost of OpenMP Offloading

Conference · Mon May 30 00:00:00 EDT 2022 · OSTI ID:1570122

Mishra, Alok; Soto, Carlos X.; Chheda, Smeet; +3 more

OpenMP Target Task: Tasking and Target Offloading on Heterogeneous Systems

Conference · Wed Jun 01 00:00:00 EDT 2022 · OSTI ID:1570122

Valero Lara, Pedro; Kim, Jungwon; Hernandez Mendoza, Oscar; +1 more

OpenMP 4.5 Validation and Verification Suite for Device Offload

Conference · Wed Aug 01 00:00:00 EDT 2018 · OSTI ID:1570122

Monsalve Diaz, Jose; Pophale, Swaroop; Hernandez Mendoza, Oscar; +2 more

Title: Concepts for OpenMP Target Offload Resilience

Citation Formats

References (19)

Similar Records

Related Subjects