Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

Hukerikar, Saurabh; Engelmann, Christian

doi:10.2172/1338552

Title: Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

Technical Report · Sat Oct 01 00:00:00 EDT 2016

DOI:https://doi.org/10.2172/1338552· OSTI ID:1338552

Hukerikar, Saurabh ^[1]; Engelmann, Christian ^[1]

Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest that very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Practical limits on power consumption in HPC systems will require future systems to embrace innovative architectures, increasing the levels of hardware and software complexities. The resilience challenge for extreme-scale HPC systems requires management of various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. These techniques must seek to improve resilience at reasonable overheads to power consumption and performance. While the HPC community has developed various solutions, application-level as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods and metrics to investigate and evaluate resilience holistically in HPC systems that consider impact scope, handling coverage, and performance & power eciency across the system stack. Additionally, few of the current approaches are portable to newer architectures and software ecosystems, which are expected to be deployed on future systems. In this document, we develop a structured approach to the management of HPC resilience based on the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. The catalog of resilience design patterns provides designers with reusable design elements. We define a design framework that enhances our understanding of the important constraints and opportunities for solutions deployed at various layers of the system stack. The framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also enables optimization of the cost-benefit trade-os among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-ecient manner in spite of frequent faults, errors, and failures of various types.

View Technical Report

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1338552

Report Number(s):: ORNL/TM-2016/687; KJ0402000; ERKJ300

Country of Publication:: United States

Language:: English

Similar Records

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)

Technical Report · Fri Dec 16 00:00:00 EST 2022 · OSTI ID:1338552

Engelmann, Christian; Ashraf, Rizwan; Hukerikar, Saurabh; +2 more

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)

Technical Report · Thu Dec 01 00:00:00 EST 2016 · OSTI ID:1338552

Hukerikar, Saurabh; Engelmann, Christian

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

Journal Article · Fri Sep 01 00:00:00 EDT 2017 · Supercomputing frontiers and innovations · OSTI ID:1338552

Engelmann, Christian; Hukerikar, Saurabh

Related Subjects

97 MATHEMATICS AND COMPUTING
resilience
design patterns
high-performance computing
exascale computing
fault tolerance

Title: Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

Citation Formats

Similar Records

Related Subjects