skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Proactive Process-Level Live Migration and Back Migration in HPC Environments

Abstract

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of process migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 s. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration.

Authors:
 [1];  [2];  [1];  [1]
  1. ORNL
  2. North Carolina State University
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1037151
DOE Contract Number:
DE-AC05-00OR22725
Resource Type:
Journal Article
Resource Relation:
Journal Name: Journal of Parallel and Distributed Computing; Journal Volume: 72; Journal Issue: 2
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; SUPERCOMPUTERS; FAULT TOLERANT COMPUTERS; INPUT-OUTPUT ANALYSIS

Citation Formats

Wang, Chao, Mueller, Frank, Engelmann, Christian, and Scott, Stephen L. Proactive Process-Level Live Migration and Back Migration in HPC Environments. United States: N. p., 2012. Web. doi:10.1016/j.jpdc.2011.10.009.
Wang, Chao, Mueller, Frank, Engelmann, Christian, & Scott, Stephen L. Proactive Process-Level Live Migration and Back Migration in HPC Environments. United States. doi:10.1016/j.jpdc.2011.10.009.
Wang, Chao, Mueller, Frank, Engelmann, Christian, and Scott, Stephen L. 2012. "Proactive Process-Level Live Migration and Back Migration in HPC Environments". United States. doi:10.1016/j.jpdc.2011.10.009.
@article{osti_1037151,
title = {Proactive Process-Level Live Migration and Back Migration in HPC Environments},
author = {Wang, Chao and Mueller, Frank and Engelmann, Christian and Scott, Stephen L},
abstractNote = {As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of process migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 s. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration.},
doi = {10.1016/j.jpdc.2011.10.009},
journal = {Journal of Parallel and Distributed Computing},
number = 2,
volume = 72,
place = {United States},
year = 2012,
month = 1
}
  • As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeuemore » MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.« less
  • This article examines the concepts of quality management (QM) and quality assurance (QA), as well as the current state of QM and QA practices in radiotherapy. A systematic approach incorporating a series of industrial engineering-based tools is proposed, which can be applied in health care organizations proactively to improve process outcomes, reduce risk and/or improve patient safety, improve through-put, and reduce cost. This tool set includes process mapping and process flowcharting, failure modes and effects analysis (FMEA), value stream mapping, and fault tree analysis (FTA). Many health care organizations do not have experience in applying these tools and therefore domore » not understand how and when to use them. As a result there are many misconceptions about how to use these tools, and they are often incorrectly applied. This article describes these industrial engineering-based tools and also how to use them, when they should be used (and not used), and the intended purposes for their use. In addition the strengths and weaknesses of each of these tools are described, and examples are given to demonstrate the application of these tools in health care settings.« less
  • Smart meters are integral to demand response in emerging smart grids, by reporting the electricity consumption of users to serve application needs. But reporting real-time usage information for individual households raises privacy concerns. Existing techniques to guarantee differential privacy (DP) of smart meter users either are not fault tolerant or achieve (possibly partial) fault tolerance at high communication overheads. In this paper, we propose a fault-tolerant protocol for smart metering that can handle general communication failures while ensuring DP with significantly improved efficiency and lower errors compared with the state of the art. Our protocol handles fail-stop faults proactively bymore » using a novel design of future ciphertexts, and distributes trust among the smart meters by sharing secret keys among them. We prove the DP properties of our protocol and analyze its advantages in fault tolerance, accuracy, and communication efficiency relative to competing techniques. We illustrate our analysis by simulations driven by real-world traces of electricity consumption.« less