skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A checkpoint compression study for high-performance computing systems

Abstract

As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart (CR) protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are: (1) compression is a very viable CR optimization; (2) generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; (3) compression-based optimizations fare well against and alongside other software-based optimizations; and (4) while hardware-based optimizations outperform software-based ones, they are not as cost effective.

Authors:
 [1];  [2];  [1]
  1. Univ. of New Mexico, Albuquerque, NM (United States). Dept. of Computer Science
  2. Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States). Scalable System Software Dept.
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1426906
Report Number(s):
SAND2014-15140J
Journal ID: ISSN 1094-3420; 534304
DOE Contract Number:  
AC04-94AL85000
Resource Type:
Journal Article
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Volume: 29; Journal Issue: 4; Journal ID: ISSN 1094-3420
Publisher:
SAGE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; fault tolerance; checkpoint/restart; checkpoint compression

Citation Formats

Ibtesham, Dewan, Ferreira, Kurt B., and Arnold, Dorian. A checkpoint compression study for high-performance computing systems. United States: N. p., 2015. Web. doi:10.1177/1094342015570921.
Ibtesham, Dewan, Ferreira, Kurt B., & Arnold, Dorian. A checkpoint compression study for high-performance computing systems. United States. doi:10.1177/1094342015570921.
Ibtesham, Dewan, Ferreira, Kurt B., and Arnold, Dorian. Tue . "A checkpoint compression study for high-performance computing systems". United States. doi:10.1177/1094342015570921. https://www.osti.gov/servlets/purl/1426906.
@article{osti_1426906,
title = {A checkpoint compression study for high-performance computing systems},
author = {Ibtesham, Dewan and Ferreira, Kurt B. and Arnold, Dorian},
abstractNote = {As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart (CR) protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are: (1) compression is a very viable CR optimization; (2) generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; (3) compression-based optimizations fare well against and alongside other software-based optimizations; and (4) while hardware-based optimizations outperform software-based ones, they are not as cost effective.},
doi = {10.1177/1094342015570921},
journal = {International Journal of High Performance Computing Applications},
issn = {1094-3420},
number = 4,
volume = 29,
place = {United States},
year = {2015},
month = {2}
}

Works referenced in this record:

A Mathematical Theory of Communication
journal, July 1948


Diskless checkpointing
journal, January 1998

  • Plank, J. S.; Puening, M. A.
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 9, Issue 10
  • DOI: 10.1109/71.730527

Understanding failures in petascale computers
journal, July 2007


PLFS: a checkpoint filesystem for parallel applications
conference, January 2009


Memory exclusion: optimizing the performance of checkpointing systems
journal, February 1999


A survey of rollback-recovery protocols in message-passing systems
journal, September 2002

  • Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
  • ACM Computing Surveys, Vol. 34, Issue 3
  • DOI: 10.1145/568522.568525

ickp: a consistent checkpointer for multicomputers
journal, July 1994

  • Plank, J. S.
  • IEEE Parallel & Distributed Technology: Systems & Applications, Vol. 2, Issue 2
  • DOI: 10.1109/88.311574

A higher order estimate of the optimum checkpoint interval for restart dumps
journal, February 2006


Low-latency, concurrent checkpointing for parallel programs
journal, January 1994

  • Kai Li, ; Naughton, J. F.; Plank, J. S.
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 5, Issue 8
  • DOI: 10.1109/71.298215

A universal algorithm for sequential data compression
journal, May 1977