Scalable Transparent Checkpoint-Restart of Global Address Space Applications on Virtual Machines over Infiniband
Checkpoint-Restart is one of the most used software approaches to achieve fault-tolerance in high-end clusters. While standard techniques typically focus on user-level solutions, the advent of virtualization software has enabled efficient and transparent system-level approaches. In this paper, we present a scalable transparent system-level solution to address fault-tolerance for applications based on global address space (GAS) programming models on Infiniband clusters. In addition to handling communication, the solution addresses transparent checkpoint of user-generated files. We exploit the support for the Infiniband network in the Xen virtual machine environment. We have developed a version of the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication library capable of suspending and resuming applications. We present efficient and scalable mechanisms to distribute checkpoint requests and to backup virtual machines memory images and file systems. We tested our approach in the context of NWChem, a popular computational chemistry suite. We demonstrated that NWChem can be executed, without any modification to the source code, on a virtualized 8-node cluster with very little overhead (below 3%). We observe that the total checkpoint time is limited by disk I/O. Finally, we measured system-size depended components of the checkpoint time on up to 1024 cores (128 nodes), demonstrating the scalability of our approach in medium/large-scale systems.
- Research Organization:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (US)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 958496
- Report Number(s):
- PNNL-SA-64617; KJ0402000
- Country of Publication:
- United States
- Language:
- English
Similar Records
Transparent System-level Migration of PGAs Applications using Xen on Infiniband
Checkpoint/Restart of Virtual Machines Based on Xen
An Efficient Hardware-Software Approach to Network Fault Tolerance with InfiniBand
Conference
·
Mon Jun 18 00:00:00 EDT 2007
·
OSTI ID:947501
Checkpoint/Restart of Virtual Machines Based on Xen
Conference
·
Sat Dec 31 23:00:00 EST 2005
·
OSTI ID:931386
An Efficient Hardware-Software Approach to Network Fault Tolerance with InfiniBand
Conference
·
Tue Sep 01 00:00:00 EDT 2009
·
OSTI ID:986727