Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Orchestrating Fault Prediction with Live Migration and Checkpointing

Conference ·
OSTI ID:1648858

Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure uncertainty cause significant application overhead. This paper develops an adaptive multi-level C/R model that incorporates a failure prediction and analysis model, which orchestrates failure prediction, checkpointing, checkpoint frequency, and proactive live migration along with the additional benefit of Burst Buffers (BB). It effectively reduces the overheads due to failures, checkpointing, and recovery. Simulation results for the Summit supercomputer yield a reduction of ~20%-86% in application overhead due to BBs, orchestrated failure prediction, and migration. We also observe a ~29% decrease in checkpoint writes to BBs, which can increase the longevity of the BB storage devices.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE; USDOE Office of Science (SC)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1648858
Country of Publication:
United States
Language:
English

Similar Records

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
Journal Article · Mon Dec 31 23:00:00 EST 2012 · Scientific Programming · OSTI ID:1197891

Proactive Fault Tolerance for HPC with Xen Virtualization
Conference · Sun Dec 31 23:00:00 EST 2006 · OSTI ID:978756

Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library
Conference · Tue Mar 20 00:00:00 EDT 2012 · OSTI ID:1047769

Related Subjects