Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters
This article describes the motivation, design andimplementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-levelcheckpoint/restart implementation for Linux clusters that targets thespace of typical High Performance Computing applications, including MPI.Application-level solutions, including both checkpointing andfault-tolerant algorithms, are recognized as more time and spaceefficient than system-level checkpoints, which cannot make use of anyapplication-specific knowledge. However, system-level checkpointingallows for preemption, making it suitable for responding to "faultprecursors" (for instance, elevated error rates from ECC memory ornetwork CRCs, or elevated temperature from sensors). Preemption can alsoincrease the efficiency of batch scheduling; for instance reducing idlecycles (by allowing for shutdown without any queue draining period orreallocation of resources to eliminate idle nodes when better fittingjobs are queued), and reducing the average queued time (by limiting largejobs to running during off-peak hours, without the need to limit thelength of such jobs). Each of these potential uses makes BLCR a valuabletool for efficient resource management in Linux clusters.
- Research Organization:
- Ernest Orlando Lawrence Berkeley NationalLaboratory, Berkeley, CA (US)
- Sponsoring Organization:
- USDOE Director. Office of Science. Advanced ScientificComputing Research
- DOE Contract Number:
- AC02-05CH11231
- OSTI ID:
- 926560
- Report Number(s):
- LBNL--60520; BnR: KJ0101030
- Journal Information:
- Journal of Physcs: Conference Series, Journal Name: Journal of Physcs: Conference Series Vol. 46
- Country of Publication:
- United States
- Language:
- English
Scalable system software: a component-based approach
|
journal | January 2005 |
The design and implementation of Zap: a system for migrating computing environments
|
journal | December 2002 |
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
|
journal | November 2005 |
| Requirements for Linux Checkpoint/Restart | report | February 2002 |
| The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart | report | April 2005 |
Similar Records
Berkeley Lab Checkpoint/Restart for Linux
The design and implementation of Berkeley Lab's linuxcheckpoint/restart