skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Journal Article · · Scientific Programming
DOI:https://doi.org/10.1155/2013/341672· OSTI ID:1197891
 [1];  [2];  [1];  [2];  [2];  [1]
  1. School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA
  2. Lawrence Livermore National Laboratory, Livermore, CA, USA

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.

Sponsoring Organization:
USDOE
Grant/Contract Number:
AC52-07NA27344
OSTI ID:
1197891
Journal Information:
Scientific Programming, Journal Name: Scientific Programming Vol. 21 Journal Issue: 3-4; ISSN 1058-9244
Publisher:
Hindawi Publishing CorporationCopyright Statement
Country of Publication:
Egypt
Language:
English
Citation Metrics:
Cited by: 24 works
Citation information provided by
Web of Science

Similar Records

...And Eat it Too: High Read Performance in Write-Optimized HPC I/O Middleware File Formats
Conference · Thu Jan 01 00:00:00 EST 2009 · OSTI ID:1197891

Orchestrating Fault Prediction with Live Migration and Checkpointing
Conference · Mon Jun 01 00:00:00 EDT 2020 · OSTI ID:1197891

Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System
Journal Article · Mon Sep 01 00:00:00 EDT 2014 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1197891

Related Subjects