skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Active/Active Replication for Highly Availalbe HPC System Services

Abstract

High performance computing (HPC) exploits multi-processor parallelism on a large scale in order to enable research in computational sciences in various areas, such as nanotechnology, quantum chemistry, nuclear fusion and astrophysics. Simulations of real-world problems using mathematical abstraction models allow scientists to gain knowledge without the need or the capability to perform physical experiments. Today's high performance computing systems have several reliability deficiencies resulting in availability and serviceability issues. Head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. This paper introduces two distinct replication methods (internal and external) for providing symmetric active/active high availability for multiple head and service nodes running in virtual synchrony. It presents a comparison of both methods in terms of expected correctness, ease-of-use and performance based on early results from ongoing work in providing symmetric active/active high availability for two HPC system services (TORQUE and PVFS metadata server). It continues with a short description of a distributed mutual exclusion algorithm and a brief statement regarding the handling of Byzantine failures. This paper concludes with an overview of past and ongoing work,more » and a short summary of the presented research.« less

Authors:
 [1];  [1];  [1];  [2]
  1. ORNL
  2. Tennessee Technological University
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Center for Computational Sciences
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1003410
DOE Contract Number:
DE-AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: International Symposium on Frontiers in Availability, Reliability and Security (FARES) 2006, Vienna, Austria, 20060420, 20060422
Country of Publication:
United States
Language:
English
Subject:
79 ASTROPHYSICS, COSMOLOGY AND ASTRONOMY; ALGORITHMS; ASTROPHYSICS; AVAILABILITY; CHEMISTRY; PERFORMANCE; RELIABILITY; REPAIR; SECURITY; TORQUE

Citation Formats

Engelmann, Christian, Scott, Steven L, Leangsuksun, Chokchai, and He, X. Active/Active Replication for Highly Availalbe HPC System Services. United States: N. p., 2006. Web.
Engelmann, Christian, Scott, Steven L, Leangsuksun, Chokchai, & He, X. Active/Active Replication for Highly Availalbe HPC System Services. United States.
Engelmann, Christian, Scott, Steven L, Leangsuksun, Chokchai, and He, X. Sun . "Active/Active Replication for Highly Availalbe HPC System Services". United States. doi:.
@article{osti_1003410,
title = {Active/Active Replication for Highly Availalbe HPC System Services},
author = {Engelmann, Christian and Scott, Steven L and Leangsuksun, Chokchai and He, X.},
abstractNote = {High performance computing (HPC) exploits multi-processor parallelism on a large scale in order to enable research in computational sciences in various areas, such as nanotechnology, quantum chemistry, nuclear fusion and astrophysics. Simulations of real-world problems using mathematical abstraction models allow scientists to gain knowledge without the need or the capability to perform physical experiments. Today's high performance computing systems have several reliability deficiencies resulting in availability and serviceability issues. Head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. This paper introduces two distinct replication methods (internal and external) for providing symmetric active/active high availability for multiple head and service nodes running in virtual synchrony. It presents a comparison of both methods in terms of expected correctness, ease-of-use and performance based on early results from ongoing work in providing symmetric active/active high availability for two HPC system services (TORQUE and PVFS metadata server). It continues with a short description of a distributed mutual exclusion algorithm and a brief statement regarding the handling of Byzantine failures. This paper concludes with an overview of past and ongoing work, and a short summary of the presented research.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sun Jan 01 00:00:00 EST 2006},
month = {Sun Jan 01 00:00:00 EST 2006}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Most of today's HPC systems employ a single head node for control, which represents a single point of failure as it interrupts an entire HPC system upon failure. Furthermore, it is also a single point of control as it disables an entire HPC system until repair. One of the most important HPC system service running on the head node is the job and resource management. If it goes down, all currently running jobs loose the service they report back to. They have to be restarted once the head node is up and running again. With this paper, we present amore » generic approach for providing symmetric active/active replication for highly available HPC job and resource management. The JOSHUA solution provides a virtually synchronous environment for continuous availability without any interruption of service and without any loss of state. Replication is performed externally via the PBS service interface without the need to modify any service code. Test results as well as a reliability analysis of our proof-of-concept prototype implementation show that continuous availability can be provided by JOSHUA with an acceptable performance trade-off.« less
  • During the last several years, we have established the symmetric active/active replication model for service-level high availability and implemented several proof-of-concept prototypes. One major deficiency of our model is its inability to deal with dependent services, since its original architecture is based on the client-service model. This paper extends our model to dependent services using its already existing mechanisms and features. The presented concept is based on the idea that a service may also be a client of another service, and multiple services may be clients of each other. A high-level abstraction is used to illustrate dependencies between clients andmore » services, and to decompose dependencies between services into respective client-service dependencies. This abstraction may be used for providing high availability in distributed computing systems with complex service-oriented architectures.« less
  • This paper summarizes our efforts over the last 3-4 years in providing symmetric active/active high availability for high-performance computing (HPC) system services. This work paves the way for high-level reliability, availability and serviceability in extreme-scale HPC systems by focusing on the most critical components, head and service nodes, and by reinforcing them with appropriate high availability solutions. This paper presents our accomplishments in the form of concepts and respective prototypes, discusses existing limitations, outlines possible future work, and describes the relevance of this research to other, planned efforts.
  • Abstract not provided.
  • Jefferson Lab has implemented a scalable, distributed, high performance mass storage system - JASMine. The system is entirely implemented in Java, provides access to robotic tape storage and includes disk cache and stage manager components. The disk manager subsystem may be used independently to manage stand-alone disk pools. The system includes a scheduler to provide policy-based access to the storage systems. Security is provided by pluggable authentication modules and is implemented at the network socket level. The tape and disk cache systems have well defined interfaces in order to provide integration with grid-based services. The system is in production andmore » being used to archive 1 TB per day from the experiments, and currently moves over 2 TB per day total. This paper will describe the architecture of JASMine; discuss the rationale for building the system, and present a transparent 3rd party file replication service to move data to collaborating institutes using JASMine, XM L, and servlet technology interfacing to grid-based file transfer mechanisms.« less