Proactive Process-Level Live Migration and Back Migration in HPC Environments

Wang, Chao; Mueller, Frank; Engelmann, Christian; Scott, Stephen L

doi:10.1016/j.jpdc.2011.10.009

Proactive Process-Level Live Migration and Back Migration in HPC Environments

Journal Article · Sat Dec 31 23:00:00 EST 2011 · Journal of Parallel and Distributed Computing

DOI:https://doi.org/10.1016/j.jpdc.2011.10.009· OSTI ID:1037151

Wang, Chao ^[1]; Mueller, Frank ^[2]; Engelmann, Christian ^[1]; Scott, Stephen L ^[1]

ORNL
North Carolina State University

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of process migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 s of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 s. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively. The work also provides a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks. Experiments indicate the larger the amount of outstanding execution, the higher the benefit due to back migration.

Research Organization:: Oak Ridge National Laboratory (ORNL)

Sponsoring Organization:: SC USDOE - Office of Science (SC)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1037151

Journal Information:: Journal of Parallel and Distributed Computing, Journal Name: Journal of Parallel and Distributed Computing Journal Issue: 2 Vol. 72; ISSN 0743-7315

Country of Publication:: United States

Language:: English

Similar Records

Proactive Process-Level Live Migration in HPC Environments

Conference · Mon Dec 31 23:00:00 EST 2007 · OSTI ID:965303

Proactive Fault Tolerance for HPC with Xen Virtualization

Conference · Sun Dec 31 23:00:00 EST 2006 · OSTI ID:978756

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance

Conference · Sun Dec 31 23:00:00 EST 2006 · OSTI ID:931501

Related Subjects

99 GENERAL AND MISCELLANEOUS
FAULT TOLERANT COMPUTERS
INPUT-OUTPUT ANALYSIS
SUPERCOMPUTERS

Proactive Process-Level Live Migration and Back Migration in HPC Environments

Citation Formats

Similar Records

Related Subjects