DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Abstract

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.

Authors:
 [1];  [2];  [1];  [2];  [2];  [1]
  1. School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA
  2. Lawrence Livermore National Laboratory, Livermore, CA, USA
Publication Date:
Sponsoring Org.:
USDOE
OSTI Identifier:
1197891
Grant/Contract Number:  
AC52-07NA27344
Resource Type:
Published Article
Journal Name:
Scientific Programming
Additional Journal Information:
Journal Name: Scientific Programming Journal Volume: 21 Journal Issue: 3-4; Journal ID: ISSN 1058-9244
Publisher:
Hindawi Publishing Corporation
Country of Publication:
Egypt
Language:
English

Citation Formats

Islam, Tanzima Zerin, Mohror, Kathryn, Bagchi, Saurabh, Moody, Adam, de Supinski, Bronis R., and Eigenmann, Rudolf. McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression. Egypt: N. p., 2013. Web. doi:10.1155/2013/341672.
Islam, Tanzima Zerin, Mohror, Kathryn, Bagchi, Saurabh, Moody, Adam, de Supinski, Bronis R., & Eigenmann, Rudolf. McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression. Egypt. https://doi.org/10.1155/2013/341672
Islam, Tanzima Zerin, Mohror, Kathryn, Bagchi, Saurabh, Moody, Adam, de Supinski, Bronis R., and Eigenmann, Rudolf. Tue . "McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression". Egypt. https://doi.org/10.1155/2013/341672.
@article{osti_1197891,
title = {McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression},
author = {Islam, Tanzima Zerin and Mohror, Kathryn and Bagchi, Saurabh and Moody, Adam and de Supinski, Bronis R. and Eigenmann, Rudolf},
abstractNote = {High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.},
doi = {10.1155/2013/341672},
journal = {Scientific Programming},
number = 3-4,
volume = 21,
place = {Egypt},
year = {Tue Jan 01 00:00:00 EST 2013},
month = {Tue Jan 01 00:00:00 EST 2013}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record
https://doi.org/10.1155/2013/341672

Citation Metrics:
Cited by: 24 works
Citation information provided by
Web of Science

Save / Share: