skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Combining Partial Redundancy and Checkpointing for HPC

Abstract

Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^15 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are becoming the norm rather than the exception during execution of HPC applications. Current fault tolerance techniques in HPC focus on reactive ways to mitigate faults, namely via checkpoint and restart (C/R). Apart from storage overheads, C/R-based fault recovery comes at an additional cost in terms of application performance because normal execution is disrupted when checkpoints are taken. Studies have shown that applications running at a large scale spend more than 50\% of their total time saving checkpoints, restarting and redoing lost work. Redundancy is another fault tolerance technique, which employs redundant processes performing the same task. If a process fails, a replica of it can take over its execution. Thus, redundant copies can decrease the overall failure rate. The downside of redundancy is that extra resources are required and there is an additional overhead on communication and synchronization. Thismore » work contributes a model and analyzes the benefit of C/R in coordination with redundancy at different degrees to minimize the total wallclock time and resources utilization of HPC applications. We further conduct experiments with an implementation of redundancy within the MPI layer on a cluster. Our experimental results confirm the benefit of dual and triple redundancy - but not for partial redundancy - and show a close fit to the model. At 80,000 processes, dual redundancy requires twice the number of processing resources for an application but allows two jobs of 128 hours wallclock time to finish within the time of just one job without redundancy. For narrow ranges of processor counts, partial redundancy results in the lowest time. Once the count exceeds 770, 000, triple redundancy has the lowest overall cost. Thus, redundancy allows one to trade-off additional resource requirements against wallclock time, which provides a tuning knob for users to adapt to resource availabilities.« less

Authors:
 [1];  [2];  [2];  [1];  [3];  [2]
  1. North Carolina State University
  2. ORNL
  3. Sandia National Laboratories (SNL)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
1081906
DOE Contract Number:  
DE-AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, Macau, China, 20120618, 20120621
Country of Publication:
United States
Language:
English

Citation Formats

Elliott, James, Kharbas, Kishor H, Fiala, David J, Mueller, Frank, Ferreira, Kurt Brian, and Engelmann, Christian. Combining Partial Redundancy and Checkpointing for HPC. United States: N. p., 2012. Web.
Elliott, James, Kharbas, Kishor H, Fiala, David J, Mueller, Frank, Ferreira, Kurt Brian, & Engelmann, Christian. Combining Partial Redundancy and Checkpointing for HPC. United States.
Elliott, James, Kharbas, Kishor H, Fiala, David J, Mueller, Frank, Ferreira, Kurt Brian, and Engelmann, Christian. Sun . "Combining Partial Redundancy and Checkpointing for HPC". United States.
@article{osti_1081906,
title = {Combining Partial Redundancy and Checkpointing for HPC},
author = {Elliott, James and Kharbas, Kishor H and Fiala, David J and Mueller, Frank and Ferreira, Kurt Brian and Engelmann, Christian},
abstractNote = {Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^15 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are becoming the norm rather than the exception during execution of HPC applications. Current fault tolerance techniques in HPC focus on reactive ways to mitigate faults, namely via checkpoint and restart (C/R). Apart from storage overheads, C/R-based fault recovery comes at an additional cost in terms of application performance because normal execution is disrupted when checkpoints are taken. Studies have shown that applications running at a large scale spend more than 50\% of their total time saving checkpoints, restarting and redoing lost work. Redundancy is another fault tolerance technique, which employs redundant processes performing the same task. If a process fails, a replica of it can take over its execution. Thus, redundant copies can decrease the overall failure rate. The downside of redundancy is that extra resources are required and there is an additional overhead on communication and synchronization. This work contributes a model and analyzes the benefit of C/R in coordination with redundancy at different degrees to minimize the total wallclock time and resources utilization of HPC applications. We further conduct experiments with an implementation of redundancy within the MPI layer on a cluster. Our experimental results confirm the benefit of dual and triple redundancy - but not for partial redundancy - and show a close fit to the model. At 80,000 processes, dual redundancy requires twice the number of processing resources for an application but allows two jobs of 128 hours wallclock time to finish within the time of just one job without redundancy. For narrow ranges of processor counts, partial redundancy results in the lowest time. Once the count exceeds 770, 000, triple redundancy has the lowest overall cost. Thus, redundancy allows one to trade-off additional resource requirements against wallclock time, which provides a tuning knob for users to adapt to resource availabilities.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2012},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: