skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable Transparent Checkpoint-Restart of Global Address Space Applications on Virtual Machines over Infiniband

Abstract

Checkpoint-Restart is one of the most used software approaches to achieve fault-tolerance in high-end clusters. While standard techniques typically focus on user-level solutions, the advent of virtualization software has enabled efficient and transparent system-level approaches. In this paper, we present a scalable transparent system-level solution to address fault-tolerance for applications based on global address space (GAS) programming models on Infiniband clusters. In addition to handling communication, the solution addresses transparent checkpoint of user-generated files. We exploit the support for the Infiniband network in the Xen virtual machine environment. We have developed a version of the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication library capable of suspending and resuming applications. We present efficient and scalable mechanisms to distribute checkpoint requests and to backup virtual machines memory images and file systems. We tested our approach in the context of NWChem, a popular computational chemistry suite. We demonstrated that NWChem can be executed, without any modification to the source code, on a virtualized 8-node cluster with very little overhead (below 3%). We observe that the total checkpoint time is limited by disk I/O. Finally, we measured system-size depended components of the checkpoint time on up to 1024 cores (128 nodes), demonstrating themore » scalability of our approach in medium/large-scale systems.« less

Authors:
; ; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
958496
Report Number(s):
PNNL-SA-64617
KJ0402000; TRN: US201002%%60
DOE Contract Number:  
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of the 6th ACM Conference on Computing Frontiers, 197-206
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICAL METHODS AND COMPUTING; 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; C CODES; ERRORS; COMPUTERS; START-UP; DATA TRANSMISSION; COMPUTER CALCULATIONS; ON-LINE SYSTEMS; Scalable; Checkpoint-Restart; Global Address Space; Virtual Machines; Infiniband

Citation Formats

Villa, Oreste, Krishnamoorthy, Sriram, Nieplocha, Jaroslaw, and Brown, David ML. Scalable Transparent Checkpoint-Restart of Global Address Space Applications on Virtual Machines over Infiniband. United States: N. p., 2009. Web. doi:10.1145/1531743.1531776.
Villa, Oreste, Krishnamoorthy, Sriram, Nieplocha, Jaroslaw, & Brown, David ML. Scalable Transparent Checkpoint-Restart of Global Address Space Applications on Virtual Machines over Infiniband. United States. doi:10.1145/1531743.1531776.
Villa, Oreste, Krishnamoorthy, Sriram, Nieplocha, Jaroslaw, and Brown, David ML. Mon . "Scalable Transparent Checkpoint-Restart of Global Address Space Applications on Virtual Machines over Infiniband". United States. doi:10.1145/1531743.1531776.
@article{osti_958496,
title = {Scalable Transparent Checkpoint-Restart of Global Address Space Applications on Virtual Machines over Infiniband},
author = {Villa, Oreste and Krishnamoorthy, Sriram and Nieplocha, Jaroslaw and Brown, David ML},
abstractNote = {Checkpoint-Restart is one of the most used software approaches to achieve fault-tolerance in high-end clusters. While standard techniques typically focus on user-level solutions, the advent of virtualization software has enabled efficient and transparent system-level approaches. In this paper, we present a scalable transparent system-level solution to address fault-tolerance for applications based on global address space (GAS) programming models on Infiniband clusters. In addition to handling communication, the solution addresses transparent checkpoint of user-generated files. We exploit the support for the Infiniband network in the Xen virtual machine environment. We have developed a version of the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication library capable of suspending and resuming applications. We present efficient and scalable mechanisms to distribute checkpoint requests and to backup virtual machines memory images and file systems. We tested our approach in the context of NWChem, a popular computational chemistry suite. We demonstrated that NWChem can be executed, without any modification to the source code, on a virtualized 8-node cluster with very little overhead (below 3%). We observe that the total checkpoint time is limited by disk I/O. Finally, we measured system-size depended components of the checkpoint time on up to 1024 cores (128 nodes), demonstrating the scalability of our approach in medium/large-scale systems.},
doi = {10.1145/1531743.1531776},
journal = {},
number = ,
volume = ,
place = {United States},
year = {2009},
month = {5}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: