skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Application-Specific Fault Tolerance via Data Access Characterization

Conference ·

Recent trends in semiconductor technology and supercomputer design predict an increasing probability of faults during an application's execution. Designing an application that is resilient to system failures requires careful evaluation of the impact of various approaches on preserving key application state. In this paper, we present our experiences in an ongoing effort to make a large computational chemistry application fault tolerant. We construct the data access signatures of key application modules to evaluate alternative fault tolerance approaches. We present the instrumentation methodology, characterization of the application modules, and evaluation of fault tolerance techniques using the information collected. The application signatures developed capture application characteristics not traditionally revealed by performance tools. We believe these can be used in the design and evaluation of runtimes beyond fault tolerance.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1036426
Report Number(s):
PNNL-SA-79368; KJ0402000; TRN: US201206%%299
Resource Relation:
Conference: Proceedings of the 17th International European Conference on Parallel and Distributed Computing, (Euro-Par 2011), August 29-September 2, 2011, Bordeaux, France. Lecture Notes in Computer Science, 6853:340-352
Country of Publication:
United States
Language:
English