Quantifying effectiveness of failure prediction and response in HPC systems : methodology and example.

Mayo, Jackson R; Chen, Frank Xiaoxiao; Pebay, Philippe Pierre; Wong, Matthew H; Thompson, David; Gentile, Ann C; Roe, Diana C; De Sapio, Vincent; Brandt, James M

Title: Quantifying effectiveness of failure prediction and response in HPC systems : methodology and example.

Conference · Tue Jun 01 00:00:00 EDT 2010

OSTI ID:1021669

Mayo, Jackson R; Chen, Frank Xiaoxiao; Pebay, Philippe Pierre; Wong, Matthew H; Thompson, David; Gentile, Ann C; Roe, Diana C; De Sapio, Vincent; Brandt, James M

Effective failure prediction and mitigation strategies in high-performance computing systems could provide huge gains in resilience of tightly coupled large-scale scientific codes. These gains would come from prediction-directed process migration and resource servicing, intelligent resource allocation, and checkpointing driven by failure predictors rather than at regular intervals based on nominal mean time to failure. Given probabilistic associations of outlier behavior in hardware-related metrics with eventual failure in hardware, system software, and/or applications, this paper explores approaches for quantifying the effects of prediction and mitigation strategies and demonstrates these using actual production system data. We describe context-relevant methodologies for determining the accuracy and cost-benefit of predictors. While many research studies have quantified the expected impact of growing system size, and the associated shortened mean time to failure (MTTF), on application performance in large-scale high-performance computing (HPC) platforms, there has been little if any work to quantify the possible gains from predicting system resource failures with significant but imperfect accuracy. This possibly stems from HPC system complexity and the fact that, to date, no one has established any good predictors of failure in these systems. Our work in the OVIS project aims to discover these predictors via a variety of data collection techniques and statistical analysis methods that yield probabilistic predictions. The question then is, 'How good or useful are these predictions?' We investigate methods for answering this question in a general setting, and illustrate them using a specific failure predictor discovered on a production system at Sandia.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC04-94AL85000

OSTI ID:: 1021669

Report Number(s):: SAND2010-4169C; TRN: US201117%%263

Resource Relation:: Conference: Proposed for presentation at the Workshop on Fault-Tolerance for HPC at Extreme Scale held June 28, 2010 in Chicago, IL.

Country of Publication:: United States

Language:: English

Similar Records

Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact. Final report

Technical Report · Thu Apr 16 00:00:00 EDT 2020 · OSTI ID:1021669

Kramer, William; Jha, Saurabh; Brandt, James; +1 more

Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery

Conference · Thu Mar 01 00:00:00 EST 2018 · OSTI ID:1021669

Ashraf, Rizwan; Hukerikar, Saurabh; Engelmann, Christian

OVIS 2.0 user%3CU%2B2019%3Es guide.

Technical Report · Wed Apr 01 00:00:00 EDT 2009 · OSTI ID:1021669

Mayo, Jackson R; Gentile, Ann C; Brandt, James M; +4 more

Related Subjects

99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE
ACCURACY
FORECASTING
METRICS
MITIGATION
PERFORMANCE
PRODUCTION

Title: Quantifying effectiveness of failure prediction and response in HPC systems : methodology and example.

Citation Formats

Similar Records

Related Subjects