Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters
This article describes the motivation, design andimplementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-levelcheckpoint/restart implementation for Linux clusters that targets thespace of typical High Performance Computing applications, including MPI.Application-level solutions, including both checkpointing andfault-tolerant algorithms, are recognized as more time and spaceefficient than system-level checkpoints, which cannot make use of anyapplication-specific knowledge. However, system-level checkpointingallows for preemption, making it suitable for responding to "faultprecursors" (for instance, elevated error rates from ECC memory ornetwork CRCs, or elevated temperature from sensors). Preemption can alsoincrease the efficiency of batch scheduling; for instance reducing idlecycles (by allowing for shutdown without any queue draining period orreallocation of resources to eliminate idle nodes when better fittingjobs are queued), and reducing the average queued time (by limiting largejobs to running during off-peak hours, without the need to limit thelength of such jobs). Each of these potential uses makes BLCR a valuabletool for efficient resource management in Linux clusters.
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOE Director. Office of Science. Advanced ScientificComputing Research
- DOE Contract Number:
- DE-AC02-05CH11231
- OSTI ID:
- 926560
- Report Number(s):
- LBNL-60520; R&D Project: KS3210; BnR: KJ0101030
- Journal Information:
- Journal of Physcs: Conference Series, Vol. 46; Related Information: Journal Publication Date: 2006
- Country of Publication:
- United States
- Language:
- English
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
|
journal | November 2005 |
Similar Records
Berkeley Lab Checkpoint/Restart for Linux
The design and implementation of Berkeley Lab's linuxcheckpoint/restart