Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

Tan, Nigel; Luettgau, Jacob; Marquez, Jack; Teranishi, Keita; Morales, Nicolas; Bhowmick, Sanjukta; Taufer, Michela; Cappello, Franck; Nicolae, Bogdan

doi:10.1145/3605573.3605639

Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

Conference · Tue Aug 01 00:00:00 EDT 2023

DOI:https://doi.org/10.1145/3605573.3605639· OSTI ID:2000364

Tan, Nigel ^[1]; Luettgau, Jacob ^[1]; Marquez, Jack ^[1]; Teranishi, Keita ^[2]; Morales, Nicolas ^[3]; Bhowmick, Sanjukta ^[4]; Taufer, Michela ^[1]; Cappello, Franck ^[5]; Nicolae, Bogdan ^[6]

University of Tennessee, Knoxville (UTK)
ORNL
Sandia National Laboratories (SNL)
University of North Texas
Argonne National Laboratory (ANL)
Argonne National Laboratory

Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many HPC workflows. This pattern introduces high I/O overheads and results in increased storage space utilization especially for workflows that need to capture the evolution of data structures with high frequency as checkpoints. In this context, many applications, such as graph pattern matching, perform sparse updates to large data structures between checkpoints. For these applications, incremental checkpointing techniques that save only the differences from one checkpoint to another can dramatically reduce the checkpoint sizes, I/O bottlenecks, and storage space utilization. However, such techniques are not without challenges: it is non-trivial to transparently determine what data has changed since a previous checkpoint and assemble the differences in a compact fashion that does not result in excessive metadata. State-of-art data reduction techniques (e.g., compression and de-duplication) have significant limitations when applied to modern HPC applications that leverage GPUs: slow at detecting the differences, generate a large amount of metadata to keep track of the differences, and ignore crucial spatiotemporal checkpoint data redundancy. This paper addresses these challenges by proposing a Merkle tree-based incremental checkpointing method to exploit GPUs' high memory bandwidth and massive parallelism. Experimental results at scale show a significant reduction of the I/O overhead and space utilization of checkpointing compared with state-of-the-art incremental checkpointing and compression techniques.

View Conference

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE; USDOE Office of Science (SC)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 2000364

Country of Publication:: United States

Language:: English

References (19)

libhashckpt: Hash-Based Incremental Checkpointing Using GPU’s Ferreira, Kurt B.; Riesen, Rolf; Brighwell, Ron Recent Advances in the Message Passing Interface https://doi.org/10.1007/978-3-642-24449-0_31	book	January 2011
Kokkos 3: Programming Model Extensions for the Exascale Era Trott, Christian; Lebrun-Grandie, Damien; Arndt, Daniel IEEE Transactions on Parallel and Distributed Systems https://doi.org/10.1109/TPDS.2021.3097283	journal	January 2021
cudaCR: An In-Kernel Application-Level Checkpoint/Restart Scheme for CUDA-Enabled GPUs Pourghassemi, Behnam; Chandramowlishwaran, Aparna 2017 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2017.100	conference	September 2017
GPU snapshot Lee, Kyushick; Sullivan, Michael B.; Hari, Siva Kumar Sastry Proceedings of the ACM International Conference on Supercomputing https://doi.org/10.1145/3330345.3330361	conference	June 2019
BlobSeer: Next-generation data management for large scale infrastructures Nicolae, Bogdan; Antoniu, Gabriel; Bougé, Luc Journal of Parallel and Distributed Computing, Vol. 71, Issue 2 https://doi.org/10.1016/j.jpdc.2010.08.004	journal	February 2011
The university of Florida sparse matrix collection Davis, Timothy A.; Hu, Yifan ACM Transactions on Mathematical Software, Vol. 38, Issue 1 https://doi.org/10.1145/2049662.2049663	journal	November 2011
Speculative Memory Checkpointing Vogt, Dirk; Miraglia, Armando; Portokalidis, Georgios Proceedings of the 16th Annual Middleware Conference https://doi.org/10.1145/2814576.2814802	conference	November 2015
Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration Rojas, Elvis; Perez, Diego; Calhoun, Jon C. 2021 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/Cluster48925.2021.00045	conference	September 2021
ndzip: A High-Throughput Parallel Lossless Compressor for Scientific Data Knorr, Fabian; Thoman, Peter; Fahringer, Thomas 2021 Data Compression Conference (DCC) https://doi.org/10.1109/DCC50243.2021.00018	conference	March 2021
Speedup Graph Processing by Graph Ordering Wei, Hao; Yu, Jeffrey Xu; Lu, Can Proceedings of the 2016 International Conference on Management of Data https://doi.org/10.1145/2882903.2915220	conference	June 2016
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale Nicolae, Bogdan; Moody, Adam; Gonsiorowski, Elsa 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2019.00099	conference	May 2019
Towards Scalable Checkpoint Restart: A Collective Inline Memory Contents Deduplication Proposal Nicolae, Bogdan 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.14	conference	May 2013
Topological network alignment uncovers biological function and phylogeny Kuchaiev, Oleksii; Milenković, Tijana; Memišević, Vesna Journal of The Royal Society Interface, Vol. 7, Issue 50 https://doi.org/10.1098/rsif.2010.0063	journal	March 2010
GraphBIG: understanding graph computing in the context of industrial solutions Nai, Lifeng; Xia, Yinglong; Tanase, Ilie G. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807626	conference	January 2015
Job migration in HPC clusters by means of checkpoint/restart Rodríguez-Pascual, Manuel; Cao, Jiajun; Moríñigo, José A. The Journal of Supercomputing, Vol. 75, Issue 10 https://doi.org/10.1007/s11227-019-02857-y	journal	April 2019
Minimal Repetition Dynamic Checkpointing Algorithm for Unsteady Adjoint Calculation Wang, Qiqi; Moin, Parviz; Iaccarino, Gianluca SIAM Journal on Scientific Computing, Vol. 31, Issue 4 https://doi.org/10.1137/080727890	journal	January 2009
Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes Cores, Iván; Rodríguez, Gabriel; martín, Mará J. New Generation Computing, Vol. 31, Issue 3 https://doi.org/10.1007/s00354-013-0302-4	journal	July 2013
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers Gioiosa, R.; Sancho, J. C.; Jiang, S. ACM/IEEE SC 2005 Conference (SC'05) https://doi.org/10.1109/SC.2005.76	conference	January 2005
gMig Ma, Jiacheng; Zheng, Xiao; Dong, Yaozu Proceedings of the 14th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments https://doi.org/10.1145/3186411.3186414	conference	March 2018

Similar Records

Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

Conference · Sat Dec 31 23:00:00 EST 2022 · OSTI ID:2229849

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

Journal Article · Mon Dec 31 23:00:00 EST 2012 · Scientific Programming · OSTI ID:1197891

DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models

Conference · Tue Dec 31 23:00:00 EST 2019 · OSTI ID:1770321

Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

Citation Formats

References (19)

Similar Records

Related Subjects