Proactive Fault Tolerance for HPC with Xen Virtualization

Nagarajan, Arun Babu; Mueller, Frank; Engelmann, Christian; Scott, Stephen L

Proactive Fault Tolerance for HPC with Xen Virtualization

Conference · Sun Dec 31 23:00:00 EST 2006

OSTI ID:978756

Nagarajan, Arun Babu ^[1]; Mueller, Frank ^[1]; Engelmann, Christian ^[2]; Scott, Stephen L ^[2]

North Carolina State University
ORNL

with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from “unhealthy” nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplied by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/ restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the rst comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.

🛈

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Research Organization:: Oak Ridge National Laboratory (ORNL)

Sponsoring Organization:: SC USDOE - Office of Science (SC)

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 978756

Country of Publication:: United States

Language:: English

Similar Records

Proactive Process-Level Live Migration in HPC Environments

Conference · Mon Dec 31 23:00:00 EST 2007 · OSTI ID:965303

Proactive Process-Level Live Migration and Back Migration in HPC Environments

Journal Article · Sat Dec 31 23:00:00 EST 2011 · Journal of Parallel and Distributed Computing · OSTI ID:1037151

Checkpoint/Restart of Virtual Machines Based on Xen

Conference · Sat Dec 31 23:00:00 EST 2005 · OSTI ID:931386

Related Subjects

99 GENERAL AND MISCELLANEOUS
MONITORING
PARALLEL PROCESSING
SUPERCOMPUTERS
TOLERANCE

Proactive Fault Tolerance for HPC with Xen Virtualization

Citation Formats

Similar Records

Related Subjects