Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Proactive Process-Level Live Migration in HPC Environments

Conference ·
OSTI ID:965303

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.

Research Organization:
Oak Ridge National Laboratory (ORNL)
Sponsoring Organization:
SC USDOE - Office of Science (SC)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
965303
Country of Publication:
United States
Language:
English

Similar Records

Proactive Process-Level Live Migration and Back Migration in HPC Environments
Journal Article · Sat Dec 31 23:00:00 EST 2011 · Journal of Parallel and Distributed Computing · OSTI ID:1037151

Proactive Fault Tolerance for HPC with Xen Virtualization
Conference · Sun Dec 31 23:00:00 EST 2006 · OSTI ID:978756

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
Conference · Sun Dec 31 23:00:00 EST 2006 · OSTI ID:931501