Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Symmetric Active/Active High Availability for High-Performance Computing System Services

Journal Article · · Journal of Computers
 [1];  [1];  [2];  [3]
  1. ORNL
  2. Louisiana Tech University
  3. Tennessee Technological University

This work aims to pave the way for high availability in high-performance computing (HPC) by focusing on efficient redundancy strategies for head and service nodes. These nodes represent single points of failure and control for an entire HPC system as they render it inaccessible and unmanageable in case of a failure until repair. The presented approach introduces two distinct replication methods, internal and external, for providing symmetric active/active high availability for multiple redundant head and service nodes running in virtual synchrony utilizing an existing process group communication system for service group membership management and reliable, totally ordered message delivery. Resented results of a prototype implementation that offers symmetric active/active replication for HPC job and resource management using external replication show that the highest level of availability can be provided with an acceptable performance trade-off.

Research Organization:
Oak Ridge National Laboratory (ORNL)
Sponsoring Organization:
ORNL LDRD Director's R&D; SC USDOE - Office of Science (SC)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
978718
Journal Information:
Journal of Computers, Journal Name: Journal of Computers Journal Issue: 8 Vol. 1
Country of Publication:
United States
Language:
English

Similar Records

Active/Active Replication for Highly Availalbe HPC System Services
Conference · Sat Dec 31 23:00:00 EST 2005 · OSTI ID:1003410

Symmetric Active/Active Metadata Service for High Availability Parallel File Systems
Journal Article · Wed Dec 31 23:00:00 EST 2008 · Journal of Parallel and Distributed Computing · OSTI ID:979283

JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management
Conference · Sat Dec 31 23:00:00 EST 2005 · OSTI ID:930763