Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Berkeley lab checkpoint/restart (BLCR) for Linux clusters

Journal Article · · Journal of Physics. Conference Series
 [1];  [1]
  1. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
This article describes the motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI. Application-level solutions, including both checkpointing and fault-tolerant algorithms, are recognized as more time and space efficient than system-level checkpoints, which cannot make use of any application-specific knowledge. However, system-level checkpointing allows for preemption, making it suitable for responding to fault precursors (for instance, elevated error rates from ECC memory or network CRCs, or elevated temperature from sensors). Preemption can also increase the efficiency of batch scheduling; for instance reducing idle cycles (by allowing for shutdown without any queue draining period or reallocation of resources to eliminate idle nodes when better fitting jobs are queued), and reducing the average queued time (by limiting large jobs to running during off-peak hours, without the need to limit the length of such jobs). Each of these potential uses makes BLCR a valuable tool for efficient resource management in Linux clusters. © 2006 IOP Publishing Ltd.
Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
Grant/Contract Number:
AC02-05CH11231
OSTI ID:
1407049
Alternate ID(s):
OSTI ID: 926560
Journal Information:
Journal of Physics. Conference Series, Journal Name: Journal of Physics. Conference Series Vol. 46; ISSN 1742-6588
Publisher:
IOP PublishingCopyright Statement
Country of Publication:
United States
Language:
English

References (1)

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing journal November 2005

Cited By (12)

Checkpointing of Parallel MPI Applications Using MPI One-sided API with Support for Byte-addressable Non-volatile RAM journal January 2016
Resiliency in Numerical Algorithm Design for Extreme Scale Simulations preprint January 2020
A lightweight software fault-tolerance system in the cloud environment: A LIGHTWEIGHT SOFTWARE FAULT-TOLERANCE SYSTEM IN THE CLOUD ENVIRONMENT journal December 2013
ER einit : Scalable and efficient fault-tolerance for bulk-synchronous MPI applications : ER journal August 2018
Node failure resiliency for Uintah without checkpointing
  • Sahasrabudhe, Damodar; Berzins, Martin; Schmidt, John
  • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 20 https://doi.org/10.1002/cpe.5340
journal June 2019
CDMCR: multi-level fault-tolerant system for distributed applications in cloud: CDMCR: multi-level fault-tolerant system for distributed applications in cloud journal January 2015
Migrating LinuX Containers Using CRIU book January 2016
Cloud resource allocation schemes: review, taxonomy, and opportunities journal May 2016
Multi-Fault Tolerance for Cartesian Data Distributions journal November 2012
Job migration in HPC clusters by means of checkpoint/restart journal April 2019
A scalable and extensible checkpointing scheme for massively parallel simulations journal May 2018
A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations preprint January 2017

Similar Records

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters
Journal Article · Wed Jul 26 00:00:00 EDT 2006 · Journal of Physcs: Conference Series · OSTI ID:926560

Berkeley Lab Checkpoint/Restart for Linux
Software · Fri Nov 14 19:00:00 EST 2003 · OSTI ID:code-54577

The design and implementation of Berkeley Lab's linuxcheckpoint/restart
Technical Report · Sat Apr 30 00:00:00 EDT 2005 · OSTI ID:891617

Related Subjects