McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh; Moody, Adam; de Supinski, Bronis R.; Eigenmann, Rudolf

doi:10.1155/2013/341672

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Journal Article · Tue Jan 01 00:00:00 EST 2013 · Scientific Programming

DOI:https://doi.org/10.1155/2013/341672· OSTI ID:1197891

Islam, Tanzima Zerin ^[1]; Mohror, Kathryn ^[2]; Bagchi, Saurabh ^[1]; Moody, Adam ^[2]; de Supinski, Bronis R. ^[2]; Eigenmann, Rudolf ^[1]

School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA
Lawrence Livermore National Laboratory, Livermore, CA, USA

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.

Sponsoring Organization:: USDOE

Grant/Contract Number:: AC52-07NA27344

OSTI ID:: 1197891

Journal Information:: Scientific Programming, Journal Name: Scientific Programming Journal Issue: 3-4 Vol. 21; ISSN 1058-9244

Publisher:: Hindawi Publishing CorporationCopyright Statement

Country of Publication:: Egypt

Language:: English

Similar Records

Orchestrating Fault Prediction with Live Migration and Checkpointing

Conference · Mon Jun 01 00:00:00 EDT 2020 · OSTI ID:1648858

Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System

Journal Article · Mon Sep 01 00:00:00 EDT 2014 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1225695

Affinity-aware checkpoint restart

Journal Article · Sun Dec 07 19:00:00 EST 2014 · ACM Digital Library · OSTI ID:1342535

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Citation Formats

Similar Records

Related Subjects