skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact. Final report

Technical Report ·
DOI:https://doi.org/10.2172/1615150· OSTI ID:1615150
 [1];  [1];  [2];  [2]
  1. Univ. of Illinois at Urbana-Champaign, IL (United States)
  2. Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)

For HPC systems to date, application resilience to faults and failures has been accomplished by the brute- force method of checkpoint/restart, which allows an application to make forward progress in the face of system and application faults, errors, and failures independent of root cause or end result. It has remained the primary resilience mechanism because we lack a way to identify faults and anticipate consequences early enough to take meaningful mitigating action. However, checkpoint/restart implementations put a tremendous burden on system resources and on the applications themselves and is becoming less feasible at scale. Because we have not yet operated at scales at which checkpoint/restart fails to provide forward progress, despite increasing costs, vendors have had little motivation to provide the instrumentation necessary for early identification of faults and failures. However, as we move from petascale to exascale, component mean time to failure (MTTF) will render the existing techniques ineffectual and/or too expensive. Furthermore, fault recovery mechanisms such as failover and/or error correction introduce performance inconsistency. Instrumentation allowing early indication of problems and tools to enable use of such information by systems, operating systems, and applications offer an alternative, more scalable and less costly solution. In the HMDR project, we built on our experience and expertise developed and accumulated over years of research on design, monitoring, measurement, and assessment of resilient computing systems. Analysis of field data on the current and past generations of extreme-scale systems revealed several challenges that, if not addressed in increasingly larger and more complex systems, may hinder the effectiveness of future exascale computing systems. Specifically, i) file systems and interconnects in current-generation large-scale systems already operate at the margins of resiliency, including consistent performance, and may not scale to larger deployments; ii) automated, software-based failover mechanisms are frequently inadequate and can introduce wider failures, such that failures during recovery may lead to system/application failures, including system-wide outages; and iii) silent data corruption represents a critical fault mode and will require efficient detection mechanisms if next-generation applications are to take full advantage of exascale hardware. To address the above challenges, we assembled a team of world-renowned experts in resilient extreme- scale computing from the University of Illinois (Electrical and Computer Engineering, Computer Science, and NCSA), SNL, LANL, NERSC, and Cray. Our team includes representatives from centers that house many of the largest HPC resources in the world, both today and over the coming years. The team has a unique track record of research in i) system and application failure characterization based on the analysis of field data, ii) data-driven design of fault/error detection mechanisms, and iii) experimental characterization of system/application resiliency. The team includes system owners/operators who provide continuous data collection and access and ensure installation of appropriate analysis tools.

Research Organization:
Univ. of Illinois at Urbana-Champaign, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
DOE Contract Number:
SC0014328
OSTI ID:
1615150
Report Number(s):
DOE-UIUC-14328
Country of Publication:
United States
Language:
English