skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance

Abstract

Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a meantime- to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unnecessary since all but one node are typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. In this paper, we overcome these shortcomings. Instead of job restart, we have developed a transparent mechanism for job pause within LAM/MPI+BLCR. This mechanism allows live nodes to remain active and roll back to the last checkpoint while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. Our methodology includes LAM/MPI enhancements in support of scalable group communicationwith fluctuating number of nodes, reuse of network connections, transparent coordinated checkpoint scheduling and a BLCR enhancement for job pause. Experiments in a cluster with the NAS Parallel Benchmark suite show that our overhead for job pause is comparable to that of a complete job restart. A minimal overhead of 5.6% is only incurred in case migration takes place while the regular checkpointmore » overhead remains unchanged. Yet, our approach alleviates the need to reboot the LAM run-time environment, which accounts for considerable overhead resulting in net savings of our scheme in the experiments. Our solution further provides full transparency and automation with the additional benefit of reusing existing resources. Executing continues after failures within the scheduled job, i.e., the application staging overhead is not incurred again in contrast to a restart. Our scheme offers additional potential for savings through incremental checkpointing and proactive diskless live migration, which we are currently working on.« less

Authors:
 [1];  [1];  [2];  [2]
  1. North Carolina State University (NCSU), Raleigh
  2. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
931501
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 21st International Parallel and Distributed Processing Symposium (IPDPS) 2007, Long Beach, CA, USA, 20070326, 20070330
Country of Publication:
United States
Language:
English

Citation Formats

Wang, Chao, Mueller, Frank, Engelmann, Christian, and Scott, Steven L. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. United States: N. p., 2007. Web. doi:10.1109/IPDPS.2007.370307.
Wang, Chao, Mueller, Frank, Engelmann, Christian, & Scott, Steven L. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. United States. doi:10.1109/IPDPS.2007.370307.
Wang, Chao, Mueller, Frank, Engelmann, Christian, and Scott, Steven L. Mon . "A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance". United States. doi:10.1109/IPDPS.2007.370307.
@article{osti_931501,
title = {A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance},
author = {Wang, Chao and Mueller, Frank and Engelmann, Christian and Scott, Steven L},
abstractNote = {Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale clusters due to a meantime- to-failure (MTTF) in the order of hours. After a failure, C/R mechanisms generally require a complete restart of an MPI job from the last checkpoint. A complete restart, however, is unnecessary since all but one node are typically still alive. Furthermore, a restart may result in lengthy job requeuing even though the original job had not exceeded its time quantum. In this paper, we overcome these shortcomings. Instead of job restart, we have developed a transparent mechanism for job pause within LAM/MPI+BLCR. This mechanism allows live nodes to remain active and roll back to the last checkpoint while failed nodes are dynamically replaced by spares before resuming from the last checkpoint. Our methodology includes LAM/MPI enhancements in support of scalable group communicationwith fluctuating number of nodes, reuse of network connections, transparent coordinated checkpoint scheduling and a BLCR enhancement for job pause. Experiments in a cluster with the NAS Parallel Benchmark suite show that our overhead for job pause is comparable to that of a complete job restart. A minimal overhead of 5.6% is only incurred in case migration takes place while the regular checkpoint overhead remains unchanged. Yet, our approach alleviates the need to reboot the LAM run-time environment, which accounts for considerable overhead resulting in net savings of our scheme in the experiments. Our solution further provides full transparency and automation with the additional benefit of reusing existing resources. Executing continues after failures within the scheduled job, i.e., the application staging overhead is not incurred again in contrast to a restart. Our scheme offers additional potential for savings through incremental checkpointing and proactive diskless live migration, which we are currently working on.},
doi = {10.1109/IPDPS.2007.370307},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2007},
month = {1}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: