A case for Virtual Machine based Fault Injection in a High-Performance Computing Environment

Vallee, Geoffroy R; Engelmann, Christian; Scott, Stephen L

Title: A case for Virtual Machine based Fault Injection in a High-Performance Computing Environment

Conference · Sat Jan 01 00:00:00 EST 2011

OSTI ID:1037028

Vallee, Geoffroy R ^[1]; Engelmann, Christian ^[1]; Scott, Stephen L ^[1]

ORNL

Large-scale computing platforms provide tremendous capabilities for scientific discovery. These systems have hundreds of thousands of computing cores, hundreds of terabytes of memory, and enormous high-performance interconnection networks. These systems are facing enormous challenges to achieve performance at such scale. Failures are an Achilles heel of these enormous systems. As applications and system software scale up to multi-petaflop and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for HPC systems. This includes work on log analysis to identify types of failures, enhancements to the Message Passing Interface (MPI) to incorporate fault awareness, and a variety of fault tolerance mechanisms that span redundant computation, algorithm based fault tolerance, and advanced checkpoint/restart techniques. While there is much work to be done on the FT/Resilience mechanisms for such large-scale systems, there is also a profound gap in the tools for experimentation. This gap is compounded by the fact that HPC environments have stringent performance requirements and are often highly customized. The tool chain for these systems are often tailored for the platform and while the majority of systems on the Top500 Supercomputer list run Linux, these operating environments typically contain many site/machine specific enhancements. Therefore, it is desirable to maintain a consistent execution environment to minimize end-user (scientist) interruption. The work on system-level virtualization for HPC system offers a unique opportunity to maintain a consistent execution environment via a virtual machine (VM). Recent work on virtualization for HPC has shown that low-overhead, high performance systems can be realized. Virtualization also provides a clean abstraction for building experimental tools for investigation into the effects of failures in HPC and the related research on FT/Resilience mechanisms and policies. In this paper we discuss the motivation for tools to perform fault injection in an HPC context, and outline an approach that can leverage virtualization.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: DE-AC05-00OR22725

OSTI ID:: 1037028

Resource Relation:: Conference: EuroPar, Bordeaux, France, 20110829, 20110829

Country of Publication:: United States

Language:: English

Similar Records

A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools

Conference · Tue Jan 01 00:00:00 EST 2013 · OSTI ID:1037028

Vallee, Geoffroy R; Boehm, Swen; Engelmann, Christian

A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools

Conference · Thu Jan 30 00:00:00 EST 2014 · 2013 FIRST INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR) · OSTI ID:1037028

Vallee, Geoffroy; Naughton, Thomas; Bohm, Swen; +1 more

HPC-Colony: Services and Interfaces to Aupport Systems With Very Large Numbers of Processors

Technical Report · Wed Jan 31 00:00:00 EST 2007 · OSTI ID:1037028

Jones, T; Kale, L; Moreira, J; +4 more

Related Subjects

99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE
ALGORITHMS
PERFORMANCE
SUPERCOMPUTERS
TOLERANCE
Fault injection
fault tolerance
virtualization

Title: A case for Virtual Machine based Fault Injection in a High-Performance Computing Environment

Citation Formats

Similar Records

Related Subjects