skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Symmetric Active/Active High Availability for High-Performance Computing System Services

Abstract

This work aims to pave the way for high availability in high-performance computing (HPC) by focusing on efficient redundancy strategies for head and service nodes. These nodes represent single points of failure and control for an entire HPC system as they render it inaccessible and unmanageable in case of a failure until repair. The presented approach introduces two distinct replication methods, internal and external, for providing symmetric active/active high availability for multiple redundant head and service nodes running in virtual synchrony utilizing an existing process group communication system for service group membership management and reliable, totally ordered message delivery. Resented results of a prototype implementation that offers symmetric active/active replication for HPC job and resource management using external replication show that the highest level of availability can be provided with an acceptable performance trade-off.

Authors:
 [1];  [1];  [2];  [3]
  1. ORNL
  2. Louisiana Tech University
  3. Tennessee Technological University
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Science (SC)
OSTI Identifier:
978718
DOE Contract Number:
DE-AC05-00OR22725
Resource Type:
Journal Article
Resource Relation:
Journal Name: Journal of Computers; Journal Volume: 1; Journal Issue: 8
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; AVAILABILITY; COMMUNICATIONS; COMPUTERS; FOCUSING; IMPLEMENTATION; MANAGEMENT; PERFORMANCE; REDUNDANCY; REPAIR; RESOURCE MANAGEMENT

Citation Formats

Engelmann, Christian, Scott, Stephen L, Chokchai, Leangsuksun, and He, X. Symmetric Active/Active High Availability for High-Performance Computing System Services. United States: N. p., 2006. Web. doi:10.4304/jcp.1.8.43-54.
Engelmann, Christian, Scott, Stephen L, Chokchai, Leangsuksun, & He, X. Symmetric Active/Active High Availability for High-Performance Computing System Services. United States. doi:10.4304/jcp.1.8.43-54.
Engelmann, Christian, Scott, Stephen L, Chokchai, Leangsuksun, and He, X. Sun . "Symmetric Active/Active High Availability for High-Performance Computing System Services". United States. doi:10.4304/jcp.1.8.43-54.
@article{osti_978718,
title = {Symmetric Active/Active High Availability for High-Performance Computing System Services},
author = {Engelmann, Christian and Scott, Stephen L and Chokchai, Leangsuksun and He, X.},
abstractNote = {This work aims to pave the way for high availability in high-performance computing (HPC) by focusing on efficient redundancy strategies for head and service nodes. These nodes represent single points of failure and control for an entire HPC system as they render it inaccessible and unmanageable in case of a failure until repair. The presented approach introduces two distinct replication methods, internal and external, for providing symmetric active/active high availability for multiple redundant head and service nodes running in virtual synchrony utilizing an existing process group communication system for service group membership management and reliable, totally ordered message delivery. Resented results of a prototype implementation that offers symmetric active/active replication for HPC job and resource management using external replication show that the highest level of availability can be provided with an acceptable performance trade-off.},
doi = {10.4304/jcp.1.8.43-54},
journal = {Journal of Computers},
number = 8,
volume = 1,
place = {United States},
year = {Sun Jan 01 00:00:00 EST 2006},
month = {Sun Jan 01 00:00:00 EST 2006}
}