Simulating failures on large-scale systems.
Developing fault management mechanisms is a difficult task because of the unpredictable nature of failures. In this paper, we present a fault simulation framework for Blue Gene/P systems implemented as a part of the Cobalt resource manager. The primary goal of this framework is to support system software development. We also present a hardware diagnostic system that we have implemented using this framework.
- Research Organization:
- Argonne National Lab. (ANL), Argonne, IL (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC); National Science Foundation (NSF); NSF-MRI GRANT
- DOE Contract Number:
- DE-AC02-06CH11357
- OSTI ID:
- 1001606
- Report Number(s):
- ANL/MCS/CP-62088; TRN: US201102%%338
- Resource Relation:
- Conference: 37th International Conference on Parallel Processing (ICPP 2008); Sep. 8, 2008 - Sep. 12, 2008; Portland, OR
- Country of Publication:
- United States
- Language:
- ENGLISH
Similar Records
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)
Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.2)
Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)
Technical Report
·
Fri Dec 16 00:00:00 EST 2022
·
OSTI ID:1001606
+2 more
Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.2)
Technical Report
·
Tue Aug 01 00:00:00 EDT 2017
·
OSTI ID:1001606
Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)
Technical Report
·
Thu Dec 01 00:00:00 EST 2016
·
OSTI ID:1001606