Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

Hargrove, Paul H; Duell, Jason C

doi:10.1088/1742-6596/46/1/067

Title: Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

Journal Article · Wed Jul 26 00:00:00 EDT 2006 · Journal of Physcs: Conference Series

DOI:https://doi.org/10.1088/1742-6596/46/1/067· OSTI ID:926560

Hargrove, Paul H; Duell, Jason C

This article describes the motivation, design andimplementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-levelcheckpoint/restart implementation for Linux clusters that targets thespace of typical High Performance Computing applications, including MPI.Application-level solutions, including both checkpointing andfault-tolerant algorithms, are recognized as more time and spaceefficient than system-level checkpoints, which cannot make use of anyapplication-specific knowledge. However, system-level checkpointingallows for preemption, making it suitable for responding to "faultprecursors" (for instance, elevated error rates from ECC memory ornetwork CRCs, or elevated temperature from sensors). Preemption can alsoincrease the efficiency of batch scheduling; for instance reducing idlecycles (by allowing for shutdown without any queue draining period orreallocation of resources to eliminate idle nodes when better fittingjobs are queued), and reducing the average queued time (by limiting largejobs to running during off-peak hours, without the need to limit thelength of such jobs). Each of these potential uses makes BLCR a valuabletool for efficient resource management in Linux clusters.

Cite

Export

Save

Research Organization:: Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: USDOE Director. Office of Science. Advanced ScientificComputing Research

DOE Contract Number:: DE-AC02-05CH11231

OSTI ID:: 926560

Report Number(s):: LBNL-60520; R&D Project: KS3210; BnR: KJ0101030

Journal Information:: Journal of Physcs: Conference Series, Vol. 46; Related Information: Journal Publication Date: 2006

Country of Publication:: United States

Language:: English

References (1)

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing Sankaran, Sriram; Squyres, Jeffrey M.; Barrett, Brian The International Journal of High Performance Computing Applications, Vol. 19, Issue 4 https://doi.org/10.1177/1094342005056139	journal	November 2005

Cited By (12)

A lightweight software fault-tolerance system in the cloud environment: A LIGHTWEIGHT SOFTWARE FAULT-TOLERANCE SYSTEM IN THE CLOUD ENVIRONMENT Chen, Gang; Jin, Hai; Zou, Deqing Concurrency and Computation: Practice and Experience, Vol. 27, Issue 12 https://doi.org/10.1002/cpe.3190	journal	December 2013
ER einit : Scalable and efficient fault-tolerance for bulk-synchronous MPI applications : ER Chakraborty, Sourav; Laguna, Ignacio; Emani, Murali Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.4863	journal	August 2018
Node failure resiliency for Uintah without checkpointing Sahasrabudhe, Damodar; Berzins, Martin; Schmidt, John Concurrency and Computation: Practice and Experience, Vol. 31, Issue 20 https://doi.org/10.1002/cpe.5340	journal	June 2019
CDMCR: multi-level fault-tolerant system for distributed applications in cloud: CDMCR: multi-level fault-tolerant system for distributed applications in cloud Qiang, Weizhong; Jiang, Changqing; Ran, Longbo Security and Communication Networks, Vol. 9, Issue 15 https://doi.org/10.1002/sec.1187	journal	January 2015
Migrating LinuX Containers Using CRIU Pickartz, Simon; Eiling, Niklas; Lankes, Stefan Lecture Notes in Computer Science https://doi.org/10.1007/978-3-319-46079-6_47	book	January 2016
Cloud resource allocation schemes: review, taxonomy, and opportunities Yousafzai, Abdullah; Gani, Abdullah; Noor, Rafidah Md Knowledge and Information Systems, Vol. 50, Issue 2 https://doi.org/10.1007/s10115-016-0951-y	journal	May 2016
Multi-Fault Tolerance for Cartesian Data Distributions Ali, Nawab; Krishnamoorthy, Sriram; Halappanavar, Mahantesh International Journal of Parallel Programming, Vol. 41, Issue 3 https://doi.org/10.1007/s10766-012-0218-5	journal	November 2012
Job migration in HPC clusters by means of checkpoint/restart Rodríguez-Pascual, Manuel; Cao, Jiajun; Moríñigo, José A. The Journal of Supercomputing, Vol. 75, Issue 10 https://doi.org/10.1007/s11227-019-02857-y	journal	April 2019
A scalable and extensible checkpointing scheme for massively parallel simulations Kohl, Nils; Hötzer, Johannes; Schornbaum, Florian The International Journal of High Performance Computing Applications, Vol. 33, Issue 4 https://doi.org/10.1177/1094342018767736	journal	May 2018
A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations Kohl, Nils; Hötzer, Johannes; Schornbaum, Florian arXiv https://doi.org/10.48550/arxiv.1708.08286	preprint	January 2017
Checkpointing of Parallel MPI Applications Using MPI One-sided API with Support for Byte-addressable Non-volatile RAM Dorożyński, Piotr; Czarnul, Paweł; Malinowski, Artur Procedia Computer Science, Vol. 80 https://doi.org/10.1016/j.procs.2016.05.295	journal	January 2016
Resiliency in Numerical Algorithm Design for Extreme Scale Simulations Agullo, Emmanuel; Altenbernd, Mirco; Anzt, Hartwig arXiv https://doi.org/10.48550/arxiv.2010.13342	preprint	January 2020

Similar Records

Berkeley lab checkpoint/restart (BLCR) for Linux clusters

Journal Article · Fri Sep 01 00:00:00 EDT 2006 · Journal of Physics. Conference Series · OSTI ID:926560

Hargrove, Paul H.; Duell, Jason C.

Berkeley Lab Checkpoint/Restart for Linux

Software · Sat Nov 15 00:00:00 EST 2003 · OSTI ID:926560

Duell, Jason C.; Roman, Eric; Hargrove, Paul H.

The design and implementation of Berkeley Lab's linuxcheckpoint/restart

Technical Report · Sat Apr 30 00:00:00 EDT 2005 · OSTI ID:926560

Duell, Jason

Related Subjects

99 GENERAL AND MISCELLANEOUS

Title: Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

Citation Formats

References (1)

Cited By (12)

Similar Records

Related Subjects