skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Simulating failures on large-scale systems.

Conference ·

Developing fault management mechanisms is a difficult task because of the unpredictable nature of failures. In this paper, we present a fault simulation framework for Blue Gene/P systems implemented as a part of the Cobalt resource manager. The primary goal of this framework is to support system software development. We also present a hardware diagnostic system that we have implemented using this framework.

Research Organization:
Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science (SC); National Science Foundation (NSF); NSF-MRI GRANT
DOE Contract Number:
DE-AC02-06CH11357
OSTI ID:
1001606
Report Number(s):
ANL/MCS/CP-62088; TRN: US201102%%338
Resource Relation:
Conference: 37th International Conference on Parallel Processing (ICPP 2008); Sep. 8, 2008 - Sep. 12, 2008; Portland, OR
Country of Publication:
United States
Language:
ENGLISH