skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management

Abstract

Most of today's HPC systems employ a single head node for control, which represents a single point of failure as it interrupts an entire HPC system upon failure. Furthermore, it is also a single point of control as it disables an entire HPC system until repair. One of the most important HPC system service running on the head node is the job and resource management. If it goes down, all currently running jobs loose the service they report back to. They have to be restarted once the head node is up and running again. With this paper, we present a generic approach for providing symmetric active/active replication for highly available HPC job and resource management. The JOSHUA solution provides a virtually synchronous environment for continuous availability without any interruption of service and without any loss of state. Replication is performed externally via the PBS service interface without the need to modify any service code. Test results as well as a reliability analysis of our proof-of-concept prototype implementation show that continuous availability can be provided by JOSHUA with an acceptable performance trade-off.

Authors:
 [1];  [1];  [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Science (SC)
OSTI Identifier:
930763
DOE Contract Number:  
DE-AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: IEEE International Conference on Cluster Computing (Cluster) 2006, Barcelona, Spain, 20060925, 20060928
Country of Publication:
United States
Language:
English
Subject:
97; 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; COMPUTER NETWORKS; AVAILABILITY; PERFORMANCE; RELIABILITY; REPAIR; MANAGEMENT; J CODES; SUPPLY DISRUPTION; SERVICE SECTOR

Citation Formats

Uhlemann, Kai, Engelmann, Christian, and Scott, Steven L. JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management. United States: N. p., 2006. Web.
Uhlemann, Kai, Engelmann, Christian, & Scott, Steven L. JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management. United States.
Uhlemann, Kai, Engelmann, Christian, and Scott, Steven L. Sun . "JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management". United States. doi:.
@article{osti_930763,
title = {JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management},
author = {Uhlemann, Kai and Engelmann, Christian and Scott, Steven L},
abstractNote = {Most of today's HPC systems employ a single head node for control, which represents a single point of failure as it interrupts an entire HPC system upon failure. Furthermore, it is also a single point of control as it disables an entire HPC system until repair. One of the most important HPC system service running on the head node is the job and resource management. If it goes down, all currently running jobs loose the service they report back to. They have to be restarted once the head node is up and running again. With this paper, we present a generic approach for providing symmetric active/active replication for highly available HPC job and resource management. The JOSHUA solution provides a virtually synchronous environment for continuous availability without any interruption of service and without any loss of state. Replication is performed externally via the PBS service interface without the need to modify any service code. Test results as well as a reliability analysis of our proof-of-concept prototype implementation show that continuous availability can be provided by JOSHUA with an acceptable performance trade-off.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sun Jan 01 00:00:00 EST 2006},
month = {Sun Jan 01 00:00:00 EST 2006}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: