skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Hybrid Checkpointing for MPI Jobs in HPC Environments

Conference ·
OSTI ID:1081792

As the core count in high-performance computing systems keeps increasing, faults are becoming common place. Check pointing addresses such faults but captures full process images even though only a subset of the process image changes between checkpoints. We have designed a hybrid check pointing technique for MPI tasks of high-performance applications. This technique alternates between full and incremental checkpoints: At incremental checkpoints, only data changed since the last checkpoint is captured. Our implementation integrates new BLCR and LAM/MPI features that complement traditional full checkpoints. This results in significantly reduced checkpoint sizes and overheads with only moderate increases in restart overhead. After accounting for cost and savings, benefits due to incremental checkpoints are an order of magnitude larger than overheads on restarts. We further derive qualitative results indicating an optimal balance between full/incremental checkpoints of our novel approach at a ratio of 1:9, which outperforms both always-full and always-incremental check pointing.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
DE-AC05-00OR22725
OSTI ID:
1081792
Resource Relation:
Conference: 16th IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010, Shanghai, China, 20101208, 20101210
Country of Publication:
United States
Language:
English

Similar Records

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
Conference · Mon Jan 01 00:00:00 EST 2007 · OSTI ID:1081792

MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications
Journal Article · Mon Jan 01 00:00:00 EST 2018 · EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018 · OSTI ID:1081792

Combining Partial Redundancy and Checkpointing for HPC
Conference · Sun Jan 01 00:00:00 EST 2012 · OSTI ID:1081792

Related Subjects