skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)

Abstract

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Therefore the resilience challenge for extreme-scale HPC systems requires management of various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. Also, due to practical limits on power consumption in HPC systems future systems are likely to embrace innovative architectures, increasing the levels of hardware and software complexities. As a result the techniques that seek to improve resilience must navigate the complex trade-off space between resilience and the overheads to power consumption and performance. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods and metrics to investigate and evaluate resilience holistically in HPC systems that consider impact scope, handling coverage, and performance & power efficiency across themore » system stack. Additionally, few of the current approaches are portable to newer architectures and software environments that will be deployed on future systems. In this document, we develop a structured approach to the management of HPC resilience using the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. Each established solution is described in the form of a pattern that addresses concrete problems in the design of resilient systems. The complete catalog of resilience design patterns provides designers with reusable design elements. We also define a framework that enhances a designer's understanding of the important constraints and opportunities for the design patterns to be implemented and deployed at various layers of the system stack. This design framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also supports optimization of the cost-benefit trade-offs among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner in spite of frequent faults, errors, and failures of various types.« less

Authors:
 [1];  [1]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1345793
Report Number(s):
ORNL/TM-2016/767
KJ0402000; ERKJ300
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; resilience; design patterns; high-performance computing; exascale computing; fault tolerance

Citation Formats

Hukerikar, Saurabh, and Engelmann, Christian. Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1). United States: N. p., 2016. Web. doi:10.2172/1345793.
Hukerikar, Saurabh, & Engelmann, Christian. Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1). United States. doi:10.2172/1345793.
Hukerikar, Saurabh, and Engelmann, Christian. Thu . "Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)". United States. doi:10.2172/1345793. https://www.osti.gov/servlets/purl/1345793.
@article{osti_1345793,
title = {Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)},
author = {Hukerikar, Saurabh and Engelmann, Christian},
abstractNote = {Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Therefore the resilience challenge for extreme-scale HPC systems requires management of various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. Also, due to practical limits on power consumption in HPC systems future systems are likely to embrace innovative architectures, increasing the levels of hardware and software complexities. As a result the techniques that seek to improve resilience must navigate the complex trade-off space between resilience and the overheads to power consumption and performance. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods and metrics to investigate and evaluate resilience holistically in HPC systems that consider impact scope, handling coverage, and performance & power efficiency across the system stack. Additionally, few of the current approaches are portable to newer architectures and software environments that will be deployed on future systems. In this document, we develop a structured approach to the management of HPC resilience using the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. Each established solution is described in the form of a pattern that addresses concrete problems in the design of resilient systems. The complete catalog of resilience design patterns provides designers with reusable design elements. We also define a framework that enhances a designer's understanding of the important constraints and opportunities for the design patterns to be implemented and deployed at various layers of the system stack. This design framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also supports optimization of the cost-benefit trade-offs among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner in spite of frequent faults, errors, and failures of various types.},
doi = {10.2172/1345793},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Dec 01 00:00:00 EST 2016},
month = {Thu Dec 01 00:00:00 EST 2016}
}

Technical Report:

Save / Share: