Berkeley Lab Checkpoint/Restart for Linux
This package implements system-level checkpointing of scientific applications running on Linux clusters in a manner suitable for implementing preemption, migration and fault recovery by a batch scheduler The design includes documented interfaces for a cooperating application or library to implement extensions to the checkpoint system, such as consistent checkpointing of distributed MPI applications Using this package with an appropriate MPI implementation, the vast majority of scientific applications which use MPI for communication are checkpointable without any modifications to the application source code. Extending VMAdump code used in the bproc system, the BLCR kernel modules provide three additional features necessary for useful system-level checkpointing of scientific applications(installation of bproc is not required to use BLCR) First, this package provides the bookkeeping and coordination required for checkpointing and restoring multi-threaded and multi-process applications running on a single node Secondly, this package provides a system call interface allowing checkpoints to be requested by any authorized process, such as a batch scheduler. Thirdly, this package provides a system call interface allowing applications and/or application libraries to extend the checkpoint capabilities in user space, for instance to provide coordination of checkpoints of distributed MPI applications. The "Iibcr" library in this package implements a wrapper around the system call interface exported by the kernel modules, and maintains bookkeeping to allow registration of callbacks by runtime libraries This library also provides the necessary thread-safety and signal-safety mechanisms Thus, this library provides the means for applications and run-time libraries, such as MPI, to register callback functions to be run when a checkpoint is taken or when restarting from one. This library may also be used as a LD_PRELOAD to enable checkpointing of applications with development releases of BLCR (which cannot preempt unmodified applications otherwise). This package also includes simple command line utilities to request a checkpoint or restart of a process. These provide the means for a user, system administrator, or batch scheduler to use BLCR.
- Short Name / Acronym:
- BLCR
- Project Type:
- Open Source, No Publicly Available Repository
- Site Accession Number:
- 4069
- Software Type:
- Scientific
- License(s):
- Other (Commercial or Open-Source)
- Research Organization:
- Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
- Sponsoring Organization:
- USDOEPrimary Award/Contract Number:AC03-76SF00098
- DOE Contract Number:
- AC03-76SF00098
- Code ID:
- 54577
- OSTI ID:
- code-54577
- Country of Origin:
- United States
Similar Records
Berkeley lab checkpoint/restart (BLCR) for Linux clusters
Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters