skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Orchestrating Fault Prediction with Live Migration and Checkpointing

Abstract

Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure uncertainty cause significant application overhead. This paper develops an adaptive multi-level C/R model that incorporates a failure prediction and analysis model, which orchestrates failure prediction, checkpointing, checkpoint frequency, and proactive live migration along with the additional benefit of Burst Buffers (BB). It effectively reduces the overheads due to failures, checkpointing, and recovery. Simulation results for the Summit supercomputer yield a reduction of ~20%-86% in application overhead due to BBs, orchestrated failure prediction, and migration. We also observe a ~29% decrease in checkpoint writes to BBs, which can increase the longevity of the BB storage devices.

Authors:
 [1]; ORCiD logo [2];  [1]; ORCiD logo [2]; ORCiD logo [2]
  1. North Carolina State University (NCSU), Raleigh
  2. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1648858
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: International Symposium on High-Performance Parallel and Distributed Computing (HPDC '20) - Stokholm, , Sweden - 6/23/2020 4:00:00 AM-6/26/2020 4:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Behera, Subhendu, Wan, Lipeng, Mueller, Frank, Wolf, Matthew D., and Klasky, Scott A. Orchestrating Fault Prediction with Live Migration and Checkpointing. United States: N. p., 2020. Web.
Behera, Subhendu, Wan, Lipeng, Mueller, Frank, Wolf, Matthew D., & Klasky, Scott A. Orchestrating Fault Prediction with Live Migration and Checkpointing. United States.
Behera, Subhendu, Wan, Lipeng, Mueller, Frank, Wolf, Matthew D., and Klasky, Scott A. Mon . "Orchestrating Fault Prediction with Live Migration and Checkpointing". United States. https://www.osti.gov/servlets/purl/1648858.
@article{osti_1648858,
title = {Orchestrating Fault Prediction with Live Migration and Checkpointing},
author = {Behera, Subhendu and Wan, Lipeng and Mueller, Frank and Wolf, Matthew D. and Klasky, Scott A.},
abstractNote = {Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure uncertainty cause significant application overhead. This paper develops an adaptive multi-level C/R model that incorporates a failure prediction and analysis model, which orchestrates failure prediction, checkpointing, checkpoint frequency, and proactive live migration along with the additional benefit of Burst Buffers (BB). It effectively reduces the overheads due to failures, checkpointing, and recovery. Simulation results for the Summit supercomputer yield a reduction of ~20%-86% in application overhead due to BBs, orchestrated failure prediction, and migration. We also observe a ~29% decrease in checkpoint writes to BBs, which can increase the longevity of the BB storage devices.},
doi = {},
url = {https://www.osti.gov/biblio/1648858}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2020},
month = {6}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: