skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Transparent System-level Migration of PGAs Applications using Xen on Infiniband

Conference ·

Abstract—Checkpoint-Restart is considered one of the most natural approaches to achieve fault-tolerance in a highperformance cluster. While early research experiences have focused their attention on user-level solutions, the advent of efficient system-level virtualization software, such as Xen and VMWare, has opened the door to the possibility of efficient and scalable cluster-level virtualization. In this paper we present an innovative approach to cluster fault-tolerance by integrating the Xen virtualization with the latest generation of the Infiniband network. A major contribution of this paper is the automatic identification of global recovery lines to freeze the status of the machine. Our focus is on the partitioned global address space (PGAS) programming model. PGAS models has been receiving an increasing amount of attention in the recent years. We have developed global coordination mechanisms and deployed it in the the ARMCI one-sided communication library that has been used as a run-time system for several PGAS models. The experimental results show that it is possible to virtualize the communication and the computation with minimal overhead and to provide seamless migration capabilities.

Research Organization:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
947501
Report Number(s):
PNNL-SA-55723; KJ0402000; TRN: US200909%%109
Resource Relation:
Conference: 2007 IEEE International Conference on Cluster Computing, 74-83
Country of Publication:
United States
Language:
English