Application-Specific Fault Tolerance via Data Access Characterization
Recent trends in semiconductor technology and supercomputer design predict an increasing probability of faults during an application's execution. Designing an application that is resilient to system failures requires careful evaluation of the impact of various approaches on preserving key application state. In this paper, we present our experiences in an ongoing effort to make a large computational chemistry application fault tolerant. We construct the data access signatures of key application modules to evaluate alternative fault tolerance approaches. We present the instrumentation methodology, characterization of the application modules, and evaluation of fault tolerance techniques using the information collected. The application signatures developed capture application characteristics not traditionally revealed by performance tools. We believe these can be used in the design and evaluation of runtimes beyond fault tolerance.
- Research Organization:
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1036426
- Report Number(s):
- PNNL-SA-79368; KJ0402000; TRN: US201206%%299
- Resource Relation:
- Conference: Proceedings of the 17th International European Conference on Parallel and Distributed Computing, (Euro-Par 2011), August 29-September 2, 2011, Bordeaux, France. Lecture Notes in Computer Science, 6853:340-352
- Country of Publication:
- United States
- Language:
- English
Similar Records
A Fault Oblivious Extreme-Scale Execution Environment
Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer