Transparent System-level Migration of PGAs Applications using Xen on Infiniband
Abstract—Checkpoint-Restart is considered one of the most natural approaches to achieve fault-tolerance in a highperformance cluster. While early research experiences have focused their attention on user-level solutions, the advent of efficient system-level virtualization software, such as Xen and VMWare, has opened the door to the possibility of efficient and scalable cluster-level virtualization. In this paper we present an innovative approach to cluster fault-tolerance by integrating the Xen virtualization with the latest generation of the Infiniband network. A major contribution of this paper is the automatic identification of global recovery lines to freeze the status of the machine. Our focus is on the partitioned global address space (PGAS) programming model. PGAS models has been receiving an increasing amount of attention in the recent years. We have developed global coordination mechanisms and deployed it in the the ARMCI one-sided communication library that has been used as a run-time system for several PGAS models. The experimental results show that it is possible to virtualize the communication and the computation with minimal overhead and to provide seamless migration capabilities.
- Research Organization:
- Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 947501
- Report Number(s):
- PNNL-SA-55723; KJ0402000; TRN: US200909%%109
- Resource Relation:
- Conference: 2007 IEEE International Conference on Cluster Computing, 74-83
- Country of Publication:
- United States
- Language:
- English
Similar Records
Efficient On-demand Connection Management Mechanisms with PGAS Models on InfiniBand
Dynamic Time-Variant Connection Management for PGAS Models on InfiniBand