skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Towards New Metrics for High-Performance Computing Resilience

Abstract

Ensuring the reliability of applications is becoming an increasingly important challenge as high-performance computing (HPC) systems experience an ever-growing number of faults, errors and failures. While the HPC community has made substantial progress in developing various resilience solutions, it continues to rely on platform-based metrics to quantify application resiliency improvements. The resilience of an HPC application is concerned with the reliability of the application outcome as well as the fault handling efficiency. To understand the scope of impact, effective coverage and performance efficiency of existing and emerging resilience solutions, there is a need for new metrics. In this paper, we develop new ways to quantify resilience that consider both the reliability and the performance characteristics of the solutions from the perspective of HPC applications. As HPC systems continue to evolve in terms of scale and complexity, it is expected that applications will experience various types of faults, errors and failures, which will require applications to apply multiple resilience solutions across the system stack. The proposed metrics are intended to be useful for understanding the combined impact of these solutions on an application's ability to produce correct results and to evaluate their overall impact on an application's performance in the presencemore » of various modes of faults.« less

Authors:
 [1];  [1];  [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1360079
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS), International Symposium on High Performance Parallel and Distributed Computing 2017, Washington, DC, USA, 20170626, 20170630
Country of Publication:
United States
Language:
English
Subject:
High Performance Computing; Resilience; Fault Tolerance; Metrics

Citation Formats

Hukerikar, Saurabh, Ashraf, Rizwan A, and Engelmann, Christian. Towards New Metrics for High-Performance Computing Resilience. United States: N. p., 2017. Web.
Hukerikar, Saurabh, Ashraf, Rizwan A, & Engelmann, Christian. Towards New Metrics for High-Performance Computing Resilience. United States.
Hukerikar, Saurabh, Ashraf, Rizwan A, and Engelmann, Christian. Sun . "Towards New Metrics for High-Performance Computing Resilience". United States. doi:.
@article{osti_1360079,
title = {Towards New Metrics for High-Performance Computing Resilience},
author = {Hukerikar, Saurabh and Ashraf, Rizwan A and Engelmann, Christian},
abstractNote = {Ensuring the reliability of applications is becoming an increasingly important challenge as high-performance computing (HPC) systems experience an ever-growing number of faults, errors and failures. While the HPC community has made substantial progress in developing various resilience solutions, it continues to rely on platform-based metrics to quantify application resiliency improvements. The resilience of an HPC application is concerned with the reliability of the application outcome as well as the fault handling efficiency. To understand the scope of impact, effective coverage and performance efficiency of existing and emerging resilience solutions, there is a need for new metrics. In this paper, we develop new ways to quantify resilience that consider both the reliability and the performance characteristics of the solutions from the perspective of HPC applications. As HPC systems continue to evolve in terms of scale and complexity, it is expected that applications will experience various types of faults, errors and failures, which will require applications to apply multiple resilience solutions across the system stack. The proposed metrics are intended to be useful for understanding the combined impact of these solutions on an application's ability to produce correct results and to evaluate their overall impact on an application's performance in the presence of various modes of faults.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sun Jan 01 00:00:00 EST 2017},
month = {Sun Jan 01 00:00:00 EST 2017}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: