RDPM: An Extensible Tool for Resilience Design Patterns Modelling
- ORNL
Resilience to faults, errors, and failures in extreme-scale high-performance computing (HPC) systems is a critical challenge. Resilience design patterns offer a new, structured hardware and software design approach for improving resilience. While prior work focused on developing performance, reliability, and availability models for resilience design patterns, this paper extends it by providing a Resilience Design Patterns Modeling (RDPM) tool which allows (1) exploring performance, reliability, and availability of each resilience design pattern, (2) offering customization of parameters to optimize performance, reliability, and availability, and (3) allowing investigation of trade-off models for combining multiple patterns for practical resilience solutions.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1872868
- Resource Relation:
- Journal Volume: 13098; Conference: 14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids - Lisbon, , Portugal - 8/30/2021 4:00:00 AM-9/3/2021 4:00:00 AM
- Country of Publication:
- United States
- Language:
- English
Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing
|
conference | March 2018 |
Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery
|
conference | March 2018 |
Basic concepts and taxonomy of dependable and secure computing
|
journal | January 2004 |
A higher order estimate of the optimum checkpoint interval for restart dumps
|
journal | February 2006 |
Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales
|
conference | November 2014 |
Combining Partial Redundancy and Checkpointing for HPC
|
conference | June 2012 |
Detection and correction of silent data corruption for large-scale high-performance computing
|
conference | November 2012 |
A Pattern Language for High-Performance Computing Resilience
|
conference | July 2017 |
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale
|
journal | October 2017 |
Models for Resilience Design Patterns
|
conference | November 2020 |
An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart
|
conference | January 2016 |
GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability
|
conference | November 2020 |
Addressing failures in exascale computing
|
journal | March 2014 |
Lazy Checkpointing: Exploiting Temporal Locality in Failures to Mitigate Checkpointing Overheads on Extreme-Scale Systems
|
conference | June 2014 |
Reliability and Performability Techniques and Tools: A Survey
|
book | January 1993 |
A first order approximation to the optimum checkpoint interval
|
journal | September 1974 |
Similar Records
Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale