A checkpoint compression study for high-performance computing systems
- Univ. of New Mexico, Albuquerque, NM (United States). Dept. of Computer Science
- Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States). Scalable System Software Dept.
As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart (CR) protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are: (1) compression is a very viable CR optimization; (2) generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; (3) compression-based optimizations fare well against and alongside other software-based optimizations; and (4) while hardware-based optimizations outperform software-based ones, they are not as cost effective.
- Research Organization:
- Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- DOE Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1426906
- Report Number(s):
- SAND2014--15140J; 534304
- Journal Information:
- International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 4 Vol. 29; ISSN 1094-3420
- Publisher:
- SAGE
- Country of Publication:
- United States
- Language:
- English
Understanding failures in petascale computers
|
journal | July 2007 |
Compiler-enhanced incremental checkpointing for OpenMP applications
|
conference | May 2009 |
libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s
|
book | January 2011 |
ickp: a consistent checkpointer for multicomputers
|
journal | July 1994 |
On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance
|
conference | September 2012 |
Checkpointing strategies for parallel jobs
|
conference | January 2011 |
MCREngine: A scalable checkpointing system using data-aware aggregation and compression
|
conference | November 2012 |
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
|
conference | May 2008 |
I/O performance challenges at leadership scale
|
conference | January 2009 |
A higher order estimate of the optimum checkpoint interval for restart dumps
|
journal | February 2006 |
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
|
book | January 2004 |
Memory exclusion: optimizing the performance of checkpointing systems
|
journal | February 1999 |
Evaluating the viability of process replication reliability for exascale systems
|
conference | January 2011 |
A Mathematical Theory of Communication
|
journal | July 1948 |
Low-latency, concurrent checkpointing for parallel programs
|
journal | January 1994 |
A survey of rollback-recovery protocols in message-passing systems
|
journal | September 2002 |
Exploring NVIDIA-CUDA for video coding
|
conference | January 2010 |
stdchk: A Checkpoint Storage System for Desktop Grid Computing
|
conference | June 2008 |
Diskless checkpointing
|
journal | January 1998 |
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
|
conference | November 2010 |
A 1 PB/s file system to checkpoint three million MPI tasks
|
conference | January 2013 |
CLIP: a checkpointing tool for message-passing parallel programs
|
conference | January 1997 |
| PLFS: a checkpoint filesystem for parallel applications | conference | January 2009 |
Efficient System-Level Remote Checkpointing Technique for BLCR
|
conference | April 2011 |
A case for two-level distributed recovery schemes
|
conference | January 1995 |
A large-scale study of failures in high-performance computing systems
|
conference | January 2006 |
The performance of consistent checkpointing
|
conference | January 1992 |
CoCheck: checkpointing and process migration for MPI
|
conference | January 1996 |
Process hijacking
|
conference | January 1999 |
Optimizing Checkpoints Using NVM as Virtual Memory
|
conference | May 2013 |
A universal algorithm for sequential data compression
|
journal | May 1977 |
CATCH-compiler-assisted techniques for checkpointing
|
conference | January 1990 |
Similar Records
Checkpointing Strategies for Shared High-Performance Computing Platforms
Toward an optimal online checkpoint solution under a two-level HPC checkpoint model