Keeping checkpoint/restart viable for exascale systems.
Description/Abstract
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoint) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms.
| DOI | 10.2172/1029780 |
|---|---|
| Creator/Author: | Riesen, Rolf E. ; Bridges, Patrick G. (IBM Research, Ireland, Mulhuddart, Dublin) ; Stearley, Jon R. ; Laros, James H., III ; Oldfield, Ron A. ; Arnold, Dorian (University of New Mexico, Albuquerque, NM) ; Pedretti, Kevin Thomas Tauke ; Ferreira, Kurt Brian ; Brightwell, Ronald Brian |
| Publication Date: | 2011 Sep 01 |
| OSTI Identifier: | OSTI ID: 1029780 |
| Report Number(s): | SAND2011-6815 |
| DOE Contract Number: | AC04-94AL85000 |
| DOI: | 10.2172/1029780 |
| Other Number(s): | TRN: US201201%%190 |
| Resource Type: | Technical Report |
| Research Org: | Sandia National Laboratories |
| Sponsoring Org: | USDOE |
| Subject: | 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; PROCESSING; RELIABILITY; SIMULATION; TOLERANCE |
| Country of Publication: | United States |
| Language: | English |
| Format: | Size: 112 p. |
| Update Date: | 2012 Jan 26 |
Full Text
? K
View Full Text
or Access Individual Pages
search, view and/or download individual pages
Cite
Select a citation type to copy/paste or download the reference.
