skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Towards New Metrics for High-Performance Computing Resilience

Abstract

Ensuring the reliability of applications is becoming an increasingly important challenge as high-performance computing (HPC) systems experience an ever-growing number of faults, errors and failures. While the HPC community has made substantial progress in developing various resilience solutions, it continues to rely on platform-based metrics to quantify application resiliency improvements. The resilience of an HPC application is concerned with the reliability of the application outcome as well as the fault handling efficiency. To understand the scope of impact, effective coverage and performance efficiency of existing and emerging resilience solutions, there is a need for new metrics. In this paper, we develop new ways to quantify resilience that consider both the reliability and the performance characteristics of the solutions from the perspective of HPC applications. As HPC systems continue to evolve in terms of scale and complexity, it is expected that applications will experience various types of faults, errors and failures, which will require applications to apply multiple resilience solutions across the system stack. The proposed metrics are intended to be useful for understanding the combined impact of these solutions on an application's ability to produce correct results and to evaluate their overall impact on an application's performance in the presencemore » of various modes of faults.« less

Authors:
 [1];  [1];  [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1360079
DOE Contract Number:
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), International Symposium on High Performance Parallel and Distributed Computing 2017, Washington, DC, USA, 20170626, 20170630
Country of Publication:
United States
Language:
English
Subject:
High Performance Computing; Resilience; Fault Tolerance; Metrics

Citation Formats

Hukerikar, Saurabh, Ashraf, Rizwan A, and Engelmann, Christian. Towards New Metrics for High-Performance Computing Resilience. United States: N. p., 2017. Web.
Hukerikar, Saurabh, Ashraf, Rizwan A, & Engelmann, Christian. Towards New Metrics for High-Performance Computing Resilience. United States.
Hukerikar, Saurabh, Ashraf, Rizwan A, and Engelmann, Christian. Sun . "Towards New Metrics for High-Performance Computing Resilience". United States. doi:.
@article{osti_1360079,
title = {Towards New Metrics for High-Performance Computing Resilience},
author = {Hukerikar, Saurabh and Ashraf, Rizwan A and Engelmann, Christian},
abstractNote = {Ensuring the reliability of applications is becoming an increasingly important challenge as high-performance computing (HPC) systems experience an ever-growing number of faults, errors and failures. While the HPC community has made substantial progress in developing various resilience solutions, it continues to rely on platform-based metrics to quantify application resiliency improvements. The resilience of an HPC application is concerned with the reliability of the application outcome as well as the fault handling efficiency. To understand the scope of impact, effective coverage and performance efficiency of existing and emerging resilience solutions, there is a need for new metrics. In this paper, we develop new ways to quantify resilience that consider both the reliability and the performance characteristics of the solutions from the perspective of HPC applications. As HPC systems continue to evolve in terms of scale and complexity, it is expected that applications will experience various types of faults, errors and failures, which will require applications to apply multiple resilience solutions across the system stack. The proposed metrics are intended to be useful for understanding the combined impact of these solutions on an application's ability to produce correct results and to evaluate their overall impact on an application's performance in the presence of various modes of faults.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sun Jan 01 00:00:00 EST 2017},
month = {Sun Jan 01 00:00:00 EST 2017}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.
  • Energy efficiency and resilience are two crucial challenges for HPC systems to reach exascale. While energy efficiency and resilience issues have been extensively studied individually, little has been done to understand the interplay between energy efficiency and resilience for HPC systems. Decreasing the supply voltage associated with a given operating frequency for processors and other CMOS-based components can significantly reduce power consumption. However, this often raises system failure rates and consequently increases application execution time. In this work, we present an energy saving undervolting approach that leverages the mainstream resilience techniques to tolerate the increased failures caused by undervolting.
  • Energy efficiency and resilience are two crucial challenges for HPC systems to reach exascale. While energy efficiency and resilience issues have been extensively studied individually, little has been done to understand the interplay between energy efficiency and resilience for HPC systems. Decreasing the supply voltage associated with a given operating frequency for processors and other CMOS-based components can significantly reduce power consumption. However, this often raises system failure rates and consequently increases application execution time. In this work, we present an energy saving undervolting approach that leverages the mainstream resilience techniques to tolerate the increased failures caused by undervolting.
  • During the last several years, our teams at Oak Ridge National Laboratory, Louisiana Tech University, and Tennessee Technological University focused on efficient redundancy strategies for head and service nodes of high-performance computing (HPC) systems in order to pave the way for high availability (HA) in HPC. These nodes typically run critical HPC system services, like job and resource management, and represent single points of failure and control for an entire HPC system. The overarching goal of our research is to provide high-level reliability, availability, and serviceability (RAS) for HPC systems by combining HA and HPC technology. This paper summarizes ourmore » accomplishments, such as developed concepts and implemented proof-of-concept prototypes, and describes existing limitations, such as performance issues, which need to be dealt with for production-type deployment.« less
  • General purpose languages, such as C++, permit the construction of various high level abstractions to hide redundant, low level details and accelerate programming productivity. Example abstractions include functions, data structures, classes, templates and so on. However, the use of abstractions significantly impedes static code analyses and optimizations, including parallelization, applied to the abstractions complex implementations. As a result, there is a common perception that performance is inversely proportional to the level of abstraction. On the other hand, programming large scale, possibly heterogeneous high-performance computing systems is notoriously difficult and programmers are less likely to abandon the help from high levelmore » abstractions when solving real-world, complex problems. Therefore, the need for programming models balancing both programming productivity and execution performance has reached a new level of criticality. We are exploring a novel abstraction-friendly programming model in order to support high productivity and high performance computing. We believe that standard or domain-specific semantics associated with high level abstractions can be exploited to aid compiler analyses and optimizations, thus helping achieving high performance without losing high productivity. We encode representative abstractions and their useful semantics into an abstraction specification file. In the meantime, an accessible, source-to-source compiler infrastructure (the ROSE compiler) is used to facilitate recognizing high level abstractions and utilizing their semantics for more optimization opportunities. Our initial work has shown that recognizing abstractions and knowing their semantics within a compiler can dramatically extend the applicability of existing optimizations, including automatic parallelization. Moreover, a new set of optimizations have become possible within an abstraction-friendly and semantics-aware programming model. In the future, we will apply our programming model to more large scale applications. In particular, we plan to classify and formalize more high level abstractions and semantics which are relevant to high performance computing. We will also investigate better ways to allow language designers, library developers and programmers to communicate abstraction and semantics information with each other.« less