Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Berkeley Lab Checkpoint/Restart for Linux

Software ·
DOI:https://doi.org/10.11578/dc.20210416.10· OSTI ID:code-54577 · Code ID:54577

This package implements system-level checkpointing of scientific applications running on Linux clusters in a manner suitable for implementing preemption, migration and fault recovery by a batch scheduler The design includes documented interfaces for a cooperating application or library to implement extensions to the checkpoint system, such as consistent checkpointing of distributed MPI applications Using this package with an appropriate MPI implementation, the vast majority of scientific applications which use MPI for communication are checkpointable without any modifications to the application source code. Extending VMAdump code used in the bproc system, the BLCR kernel modules provide three additional features necessary for useful system-level checkpointing of scientific applications(installation of bproc is not required to use BLCR) First, this package provides the bookkeeping and coordination required for checkpointing and restoring multi-threaded and multi-process applications running on a single node Secondly, this package provides a system call interface allowing checkpoints to be requested by any authorized process, such as a batch scheduler. Thirdly, this package provides a system call interface allowing applications and/or application libraries to extend the checkpoint capabilities in user space, for instance to provide coordination of checkpoints of distributed MPI applications. The "Iibcr" library in this package implements a wrapper around the system call interface exported by the kernel modules, and maintains bookkeeping to allow registration of callbacks by runtime libraries This library also provides the necessary thread-safety and signal-safety mechanisms Thus, this library provides the means for applications and run-time libraries, such as MPI, to register callback functions to be run when a checkpoint is taken or when restarting from one. This library may also be used as a LD_PRELOAD to enable checkpointing of applications with development releases of BLCR (which cannot preempt unmodified applications otherwise). This package also includes simple command line utilities to request a checkpoint or restart of a process. These provide the means for a user, system administrator, or batch scheduler to use BLCR.

Short Name / Acronym:
BLCR
Project Type:
Open Source, No Publicly Available Repository
Site Accession Number:
4069
Software Type:
Scientific
License(s):
Other (Commercial or Open-Source)
Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE

Primary Award/Contract Number:
AC03-76SF00098
DOE Contract Number:
AC03-76SF00098
Code ID:
54577
OSTI ID:
code-54577
Country of Origin:
United States

Similar Records

The design and implementation of Berkeley Lab's linuxcheckpoint/restart
Technical Report · Sat Apr 30 00:00:00 EDT 2005 · OSTI ID:891617

Berkeley lab checkpoint/restart (BLCR) for Linux clusters
Journal Article · Fri Sep 01 00:00:00 EDT 2006 · Journal of Physics. Conference Series · OSTI ID:1407049

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters
Journal Article · Wed Jul 26 00:00:00 EDT 2006 · Journal of Physcs: Conference Series · OSTI ID:926560

Related Subjects