skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

Journal Article · · Journal of Physcs: Conference Series

This article describes the motivation, design andimplementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-levelcheckpoint/restart implementation for Linux clusters that targets thespace of typical High Performance Computing applications, including MPI.Application-level solutions, including both checkpointing andfault-tolerant algorithms, are recognized as more time and spaceefficient than system-level checkpoints, which cannot make use of anyapplication-specific knowledge. However, system-level checkpointingallows for preemption, making it suitable for responding to "faultprecursors" (for instance, elevated error rates from ECC memory ornetwork CRCs, or elevated temperature from sensors). Preemption can alsoincrease the efficiency of batch scheduling; for instance reducing idlecycles (by allowing for shutdown without any queue draining period orreallocation of resources to eliminate idle nodes when better fittingjobs are queued), and reducing the average queued time (by limiting largejobs to running during off-peak hours, without the need to limit thelength of such jobs). Each of these potential uses makes BLCR a valuabletool for efficient resource management in Linux clusters.

Research Organization:
Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)
Sponsoring Organization:
USDOE Director. Office of Science. Advanced ScientificComputing Research
DOE Contract Number:
DE-AC02-05CH11231
OSTI ID:
926560
Report Number(s):
LBNL-60520; R&D Project: KS3210; BnR: KJ0101030
Journal Information:
Journal of Physcs: Conference Series, Vol. 46; Related Information: Journal Publication Date: 2006
Country of Publication:
United States
Language:
English

References (1)

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing journal November 2005

Cited By (12)

A lightweight software fault-tolerance system in the cloud environment: A LIGHTWEIGHT SOFTWARE FAULT-TOLERANCE SYSTEM IN THE CLOUD ENVIRONMENT journal December 2013
ER einit : Scalable and efficient fault-tolerance for bulk-synchronous MPI applications : ER journal August 2018
Node failure resiliency for Uintah without checkpointing
  • Sahasrabudhe, Damodar; Berzins, Martin; Schmidt, John
  • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 20 https://doi.org/10.1002/cpe.5340
journal June 2019
CDMCR: multi-level fault-tolerant system for distributed applications in cloud: CDMCR: multi-level fault-tolerant system for distributed applications in cloud journal January 2015
Migrating LinuX Containers Using CRIU book January 2016
Cloud resource allocation schemes: review, taxonomy, and opportunities journal May 2016
Multi-Fault Tolerance for Cartesian Data Distributions journal November 2012
Job migration in HPC clusters by means of checkpoint/restart journal April 2019
A scalable and extensible checkpointing scheme for massively parallel simulations journal May 2018
A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations preprint January 2017
Checkpointing of Parallel MPI Applications Using MPI One-sided API with Support for Byte-addressable Non-volatile RAM journal January 2016
Resiliency in Numerical Algorithm Design for Extreme Scale Simulations preprint January 2020

Similar Records

Berkeley lab checkpoint/restart (BLCR) for Linux clusters
Journal Article · Fri Sep 01 00:00:00 EDT 2006 · Journal of Physics. Conference Series · OSTI ID:926560

Berkeley Lab Checkpoint/Restart for Linux
Software · Sat Nov 15 00:00:00 EST 2003 · OSTI ID:926560

The design and implementation of Berkeley Lab's linuxcheckpoint/restart
Technical Report · Sat Apr 30 00:00:00 EDT 2005 · OSTI ID:926560

Related Subjects