Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

Conference ·

Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many HPC workflows. This pattern introduces high I/O overheads and results in increased storage space utilization especially for workflows that need to capture the evolution of data structures with high frequency as checkpoints. In this context, many applications, such as graph pattern matching, perform sparse updates to large data structures between checkpoints. For these applications, incremental checkpointing techniques that save only the differences from one checkpoint to another can dramatically reduce the checkpoint sizes, I/O bottlenecks, and storage space utilization. However, such techniques are not without challenges: it is non-trivial to transparently determine what data has changed since a previous checkpoint and assemble the differences in a compact fashion that does not result in excessive metadata. State-of-art data reduction techniques (e.g., compression and de-duplication) have significant limitations when applied to modern HPC applications that leverage GPUs: slow at detecting the differences, generate a large amount of metadata to keep track of the differences, and ignore crucial spatiotemporal checkpoint data redundancy. This paper addresses these challenges by proposing a Merkle tree-based incremental checkpointing method to exploit GPUs' high memory bandwidth and massive parallelism. Experimental results at scale show a significant reduction of the I/O overhead and space utilization of checkpointing compared with state-of the-art incremental checkpointing and compression techniques.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR); National Science Foundation (NSF)
DOE Contract Number:
AC02-06CH11357
OSTI ID:
2229849
Resource Relation:
Conference: 52nd International Conference on Parallel Processing, 08/07/23 - 08/10/23, Salt Lake City, UT, US
Country of Publication:
United States
Language:
English

References (19)

Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes journal July 2013
The university of Florida sparse matrix collection journal November 2011
libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s book January 2011
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers conference January 2005
ndzip: A High-Throughput Parallel Lossless Compressor for Scientific Data conference March 2021
Topological network alignment uncovers biological function and phylogeny journal March 2010
GPU snapshot conference June 2019
gMig conference March 2018
A study on data deduplication in HPC storage systems
  • Meister, Dirk; Kaiser, Jurgen; Brinkmann, Andre
  • 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.14
conference November 2012
GraphBIG: understanding graph computing in the context of industrial solutions
  • Nai, Lifeng; Xia, Yinglong; Tanase, Ilie G.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807626
conference January 2015
Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal
  • Nicolae, Bogdan
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.14
conference May 2013
BlobSeer: Next-generation data management for large scale infrastructures journal February 2011
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale conference May 2019
Job migration in HPC clusters by means of checkpoint/restart journal April 2019
Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration conference September 2021
Kokkos 3: Programming Model Extensions for the Exascale Era journal January 2021
Speculative Memory Checkpointing conference November 2015
Minimal Repetition Dynamic Checkpointing Algorithm for Unsteady Adjoint Calculation journal January 2009
Speedup Graph Processing by Graph Ordering conference June 2016