skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Final Report - Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact

Abstract

For HPC systems to date, application resilience to faults and failures has been accomplished by the brute- force method of checkpoint/restart, which allows an application to make forward progress in the face of system and application faults, errors, and failures independent of root cause or end result. It has remained the primary resilience mechanism because we lack a way to identify faults and anticipate consequences early enough to take meaningful mitigating action. However, checkpoint/restart implementations put a tremendous burden on system resources and on the applications themselves and is becoming less feasible at scale. Because we have not yet operated at scales at which checkpoint/restart fails to provide forward progress, despite increasing costs, vendors have had little motivation to provide the instrumentation necessary for early identification of faults and failures. However, as we move from petascale to exascale, component mean time to failure (MTTF) will render the existing techniques ineffectual and/or too expensive. Furthermore, fault recovery mechanisms such as failover and/or error correction introduce performance inconsistency. Instrumentation allowing early indication of problems and tools to enable use of such information by systems, operating systems, and applications offer an alternative, more scalable and less costly solution. In the HMDR project, wemore » built on our experience and expertise developed and accumulated over years of research on design, monitoring, measurement, and assessment of resilient computing systems. Analysis of field data on the current and past generations of extreme-scale systems revealed several challenges that, if not addressed in increasingly larger and more complex systems, may hinder the effectiveness of future exascale computing systems. Specifically, i) file systems and interconnects in current-generation large-scale systems already operate at the margins of resiliency, including consistent performance, and may not scale to larger deployments; ii) automated, software-based failover mechanisms are frequently inadequate and can introduce wider failures, such that failures during recovery may lead to system/application failures, including system-wide outages; and iii) silent data corruption represents a critical fault mode and will require efficient detection mechanisms if next-generation applications are to take full advantage of exascale hardware. To address the above challenges, we assembled a team of world-renowned experts in resilient extreme- scale computing from the University of Illinois (Electrical and Computer Engineering, Computer Science, and NCSA), SNL, LANL, NERSC, and Cray. Our team includes representatives from centers that house many of the largest HPC resources in the world, both today and over the coming years. The team has a unique track record of research in i) system and application failure characterization based on the analysis of field data, ii) data-driven design of fault/error detection mechanisms, and iii) experimental characterization of system/application resiliency. The team includes system owners/operators who provide continuous data collection and access and ensure installation of appropriate analysis tools.« less

Authors:
 [1];  [1];  [2];  [2]
  1. University of Illinois Urbana-Champaign
  2. Sandia National Laboratory
Publication Date:
Research Org.:
Lead-PI: William T. Kramer; Institutional PI: James Brandt, Sandia National Laboratory; Institutional PI: James Lujan, Los Alamos National Laboratory; Institutional PI: Nicholas Wright, National Energy Research Scientific Computing Center and Lawrence Berkeley National Laboratory; Institutional PI (unfunded): Larry Kaplan, Cray Inc.; Institutional PI: Ravishankar Iyer, Electrical and Computer Engineering, University of Illinois Urbana-Champaign
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1615150
Report Number(s):
DOE-UIUC-14328
DOE Contract Number:  
SC0014328
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
HPC; Fault Injection; extreme-scale; resilience

Citation Formats

Kramer, William, Jha, Saurabh, Brandt, James, and Gentile, Ann. Final Report - Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact. United States: N. p., 2020. Web. doi:10.2172/1615150.
Kramer, William, Jha, Saurabh, Brandt, James, & Gentile, Ann. Final Report - Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact. United States. https://doi.org/10.2172/1615150
Kramer, William, Jha, Saurabh, Brandt, James, and Gentile, Ann. 2020. "Final Report - Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact". United States. https://doi.org/10.2172/1615150. https://www.osti.gov/servlets/purl/1615150.
@article{osti_1615150,
title = {Final Report - Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact},
author = {Kramer, William and Jha, Saurabh and Brandt, James and Gentile, Ann},
abstractNote = {For HPC systems to date, application resilience to faults and failures has been accomplished by the brute- force method of checkpoint/restart, which allows an application to make forward progress in the face of system and application faults, errors, and failures independent of root cause or end result. It has remained the primary resilience mechanism because we lack a way to identify faults and anticipate consequences early enough to take meaningful mitigating action. However, checkpoint/restart implementations put a tremendous burden on system resources and on the applications themselves and is becoming less feasible at scale. Because we have not yet operated at scales at which checkpoint/restart fails to provide forward progress, despite increasing costs, vendors have had little motivation to provide the instrumentation necessary for early identification of faults and failures. However, as we move from petascale to exascale, component mean time to failure (MTTF) will render the existing techniques ineffectual and/or too expensive. Furthermore, fault recovery mechanisms such as failover and/or error correction introduce performance inconsistency. Instrumentation allowing early indication of problems and tools to enable use of such information by systems, operating systems, and applications offer an alternative, more scalable and less costly solution. In the HMDR project, we built on our experience and expertise developed and accumulated over years of research on design, monitoring, measurement, and assessment of resilient computing systems. Analysis of field data on the current and past generations of extreme-scale systems revealed several challenges that, if not addressed in increasingly larger and more complex systems, may hinder the effectiveness of future exascale computing systems. Specifically, i) file systems and interconnects in current-generation large-scale systems already operate at the margins of resiliency, including consistent performance, and may not scale to larger deployments; ii) automated, software-based failover mechanisms are frequently inadequate and can introduce wider failures, such that failures during recovery may lead to system/application failures, including system-wide outages; and iii) silent data corruption represents a critical fault mode and will require efficient detection mechanisms if next-generation applications are to take full advantage of exascale hardware. To address the above challenges, we assembled a team of world-renowned experts in resilient extreme- scale computing from the University of Illinois (Electrical and Computer Engineering, Computer Science, and NCSA), SNL, LANL, NERSC, and Cray. Our team includes representatives from centers that house many of the largest HPC resources in the world, both today and over the coming years. The team has a unique track record of research in i) system and application failure characterization based on the analysis of field data, ii) data-driven design of fault/error detection mechanisms, and iii) experimental characterization of system/application resiliency. The team includes system owners/operators who provide continuous data collection and access and ensure installation of appropriate analysis tools.},
doi = {10.2172/1615150},
url = {https://www.osti.gov/biblio/1615150}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2020},
month = {4}
}