A checkpoint compression study for high-performance computing systems

Ibtesham, Dewan; Ferreira, Kurt B.; Arnold, Dorian

doi:10.1177/1094342015570921

A checkpoint compression study for high-performance computing systems

Journal Article · Tue Feb 17 04:00:00 EST 2015 · International Journal of High Performance Computing Applications

DOI:https://doi.org/10.1177/1094342015570921· OSTI ID:1426906

Ibtesham, Dewan ^[1]; Ferreira, Kurt B. ^[2]; Arnold, Dorian ^[1]

Univ. of New Mexico, Albuquerque, NM (United States). Dept. of Computer Science
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States). Scalable System Software Dept.

As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart (CR) protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are: (1) compression is a very viable CR optimization; (2) generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; (3) compression-based optimizations fare well against and alongside other software-based optimizations; and (4) while hardware-based optimizations outperform software-based ones, they are not as cost effective.

View Journal Article

Research Organization:: Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA)

DOE Contract Number:: AC04-94AL85000

OSTI ID:: 1426906

Report Number(s):: SAND2014--15140J; 534304

Journal Information:: International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 4 Vol. 29; ISSN 1094-3420

Publisher:: SAGE

Country of Publication:: United States

Language:: English

References (32)

Understanding failures in petascale computers Schroeder, Bianca; Gibson, Garth A. Journal of Physics: Conference Series, Vol. 78 https://doi.org/10.1088/1742-6596/78/1/012022	journal	July 2007
Compiler-enhanced incremental checkpointing for OpenMP applications Bronevetsky, Greg; Marques, Daniel; Pingali, Keshav Distributed Processing (IPDPS), 2009 IEEE International Symposium on Parallel & Distributed Processing https://doi.org/10.1109/IPDPS.2009.5160999	conference	May 2009
libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s Ferreira, Kurt B.; Riesen, Rolf; Brighwell, Ron Recent Advances in the Message Passing Interface https://doi.org/10.1007/978-3-642-24449-0_31	book	January 2011
ickp: a consistent checkpointer for multicomputers Plank, J. S. IEEE Parallel & Distributed Technology: Systems & Applications, Vol. 2, Issue 2 https://doi.org/10.1109/88.311574	journal	July 1994
On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance Ibtesham, Dewan; Arnold, Dorian; Bridges, Patrick G. 2012 41st International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2012.45	conference	September 2012
Checkpointing strategies for parallel jobs Bougeret, Marin; Casanova, Henri; Rabie, Mikael Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063428	conference	January 2011
MCREngine: A scalable checkpointing system using data-aware aggregation and compression Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.77	conference	November 2012
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments Naksinehaboon, N.; Leangsuksun, C. 2008 8th International Symposium on Cluster Computing and the Grid (CCGRID '08), 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID) https://doi.org/10.1109/CCGRID.2008.109	conference	May 2008
I/O performance challenges at leadership scale Lang, Samuel; Carns, Philip; Latham, Robert Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09 https://doi.org/10.1145/1654059.1654100	conference	January 2009
A higher order estimate of the optimum checkpoint interval for restart dumps Daly, J. T. Future Generation Computer Systems, Vol. 22, Issue 3, p. 303-312 https://doi.org/10.1016/j.future.2004.11.016	journal	February 2006
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation Gabriel, Edgar; Fagg, Graham E.; Bosilca, George Recent Advances in Parallel Virtual Machine and Message Passing Interface https://doi.org/10.1007/978-3-540-30218-6_19	book	January 2004
Memory exclusion: optimizing the performance of checkpointing systems Plank, James S.; Chen, Yuqun; Li, Kai Software: Practice and Experience, Vol. 29, Issue 2 https://doi.org/10.1002/(SICI)1097-024X(199902)29:2<125::AID-SPE224>3.0.CO;2-7	journal	February 1999
Evaluating the viability of process replication reliability for exascale systems Ferreira, Kurt; Stearley, Jon; Laros, James H. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443	conference	January 2011
A Mathematical Theory of Communication Shannon, C. E. Bell System Technical Journal, Vol. 27, Issue 3 https://doi.org/10.1002/j.1538-7305.1948.tb01338.x	journal	July 1948
Low-latency, concurrent checkpointing for parallel programs No authors listed IEEE Transactions on Parallel and Distributed Systems, Vol. 5, Issue 8 https://doi.org/10.1109/71.298215	journal	January 1994
A survey of rollback-recovery protocols in message-passing systems Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min ACM Computing Surveys, Vol. 34, Issue 3 https://doi.org/10.1145/568522.568525	journal	September 2002
Exploring NVIDIA-CUDA for video coding Colic, Aleksandar; Kalva, Hari; Furht, Borko Proceedings of the first annual ACM SIGMM conference on Multimedia systems - MMSys '10 https://doi.org/10.1145/1730836.1730839	conference	January 2010
stdchk: A Checkpoint Storage System for Desktop Grid Computing Al-Kiswany, Samer; Ripeanu, Matei; Vazhkudai, Sudharshan S. 2008 28th IEEE International Conference on Distributed Computing Systems (ICDCS), 2008 The 28th International Conference on Distributed Computing Systems https://doi.org/10.1109/ICDCS.2008.19	conference	June 2008
Diskless checkpointing Plank, J. S.; Puening, M. A. IEEE Transactions on Parallel and Distributed Systems, Vol. 9, Issue 10 https://doi.org/10.1109/71.730527	journal	January 1998
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.18	conference	November 2010
A 1 PB/s file system to checkpoint three million MPI tasks Rajachandrasekar, Raghunath; Moody, Adam; Mohror, Kathryn Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13 https://doi.org/10.1145/2493123.2462908	conference	January 2013
CLIP: a checkpointing tool for message-passing parallel programs Chen, Yuqun; Plank, James S.; Li, Kai Proceedings of the 1997 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '97 https://doi.org/10.1145/509593.509626	conference	January 1997
PLFS: a checkpoint filesystem for parallel applications Bent, John; Gibson, Garth; Grider, Gary https://doi.org/10.1145/1654059.1654081	conference	January 2009
Efficient System-Level Remote Checkpointing Technique for BLCR Cornwell, Jason; Kongmunvattana, Angkul 2011 Eighth International Conference on Information Technology: New Generations (ITNG) https://doi.org/10.1109/ITNG.2011.172	conference	April 2011
A case for two-level distributed recovery schemes Vaidya, Nitin H. Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems - SIGMETRICS '95/PERFORMANCE '95 https://doi.org/10.1145/223587.223596	conference	January 1995
A large-scale study of failures in high-performance computing systems Schroeder, B.; Gibson, G. A. International Conference on Dependable Systems and Networks (DSN'06) https://doi.org/10.1109/DSN.2006.5	conference	January 2006
The performance of consistent checkpointing Elnozahy, E. N.; Johnson, D. B.; Zwaenepoel, W. [1992] 11th Symposium on Reliable Distributed Systems, [1992] Proceedings 11th Symposium on Reliable Distributed Systems https://doi.org/10.1109/RELDIS.1992.235144	conference	January 1992
CoCheck: checkpointing and process migration for MPI Stellner, G. Proceedings of International Conference on Parallel Processing https://doi.org/10.1109/IPPS.1996.508106	conference	January 1996
Process hijacking Zandy, V. C.; Miller, B. P.; Livny, M. Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469) https://doi.org/10.1109/HPDC.1999.805296	conference	January 1999
Optimizing Checkpoints Using NVM as Virtual Memory Kannan, Sudarsun; Gavrilovska, Ada; Schwan, Karsten 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.69	conference	May 2013
A universal algorithm for sequential data compression Ziv, J.; Lempel, A. IEEE Transactions on Information Theory, Vol. 23, Issue 3 https://doi.org/10.1109/TIT.1977.1055714	journal	May 1977
CATCH-compiler-assisted techniques for checkpointing Li, C. -C. J.; Fuchs, W. K. [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium https://doi.org/10.1109/FTCS.1990.89337	conference	January 1990

Similar Records

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Journal Article · Mon Dec 31 19:00:00 EST 2012 · Scientific Programming · OSTI ID:1197891

Checkpointing Strategies for Shared High-Performance Computing Platforms

Journal Article · Mon Dec 31 19:00:00 EST 2018 · International Journal of Networking and Computing · OSTI ID:1492861

Toward an optimal online checkpoint solution under a two-level HPC checkpoint model

Journal Article · Mon Mar 28 20:00:00 EDT 2016 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1346727

Related Subjects

97 MATHEMATICS AND COMPUTING
checkpoint compression
checkpoint/restart
fault tolerance

A checkpoint compression study for high-performance computing systems

Citation Formats

References (32)

Similar Records

Related Subjects