Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Quantifying effectiveness of failure prediction and response in HPC systems : methodology and example.

Conference ·
OSTI ID:1021669

Effective failure prediction and mitigation strategies in high-performance computing systems could provide huge gains in resilience of tightly coupled large-scale scientific codes. These gains would come from prediction-directed process migration and resource servicing, intelligent resource allocation, and checkpointing driven by failure predictors rather than at regular intervals based on nominal mean time to failure. Given probabilistic associations of outlier behavior in hardware-related metrics with eventual failure in hardware, system software, and/or applications, this paper explores approaches for quantifying the effects of prediction and mitigation strategies and demonstrates these using actual production system data. We describe context-relevant methodologies for determining the accuracy and cost-benefit of predictors. While many research studies have quantified the expected impact of growing system size, and the associated shortened mean time to failure (MTTF), on application performance in large-scale high-performance computing (HPC) platforms, there has been little if any work to quantify the possible gains from predicting system resource failures with significant but imperfect accuracy. This possibly stems from HPC system complexity and the fact that, to date, no one has established any good predictors of failure in these systems. Our work in the OVIS project aims to discover these predictors via a variety of data collection techniques and statistical analysis methods that yield probabilistic predictions. The question then is, 'How good or useful are these predictions?' We investigate methods for answering this question in a general setting, and illustrate them using a specific failure predictor discovered on a production system at Sandia.

Research Organization:
Sandia National Laboratories
Sponsoring Organization:
USDOE
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1021669
Report Number(s):
SAND2010-4169C
Country of Publication:
United States
Language:
English

Similar Records

Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery
Conference · Wed Feb 28 23:00:00 EST 2018 · OSTI ID:1454399

OVIS 2.0 user%3CU%2B2019%3Es guide.
Technical Report · Wed Apr 01 00:00:00 EDT 2009 · OSTI ID:1028957

OVIS 3.2 user's guide.
Technical Report · Fri Oct 01 00:00:00 EDT 2010 · OSTI ID:1010855