Active/Active Replication for Highly Availalbe HPC System Services

Engelmann, Christian; Scott, Steven L; Leangsuksun, Chokchai; He, X.

Title: Active/Active Replication for Highly Availalbe HPC System Services

Conference · Sun Jan 01 00:00:00 EST 2006

OSTI ID:1003410

Engelmann, Christian ^[1]; Scott, Steven L ^[1]; Leangsuksun, Chokchai ^[1]; He, X. ^[2]

ORNL
Tennessee Technological University

High performance computing (HPC) exploits multi-processor parallelism on a large scale in order to enable research in computational sciences in various areas, such as nanotechnology, quantum chemistry, nuclear fusion and astrophysics. Simulations of real-world problems using mathematical abstraction models allow scientists to gain knowledge without the need or the capability to perform physical experiments. Today's high performance computing systems have several reliability deficiencies resulting in availability and serviceability issues. Head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. This paper introduces two distinct replication methods (internal and external) for providing symmetric active/active high availability for multiple head and service nodes running in virtual synchrony. It presents a comparison of both methods in terms of expected correctness, ease-of-use and performance based on early results from ongoing work in providing symmetric active/active high availability for two HPC system services (TORQUE and PVFS metadata server). It continues with a short description of a distributed mutual exclusion algorithm and a brief statement regarding the handling of Byzantine failures. This paper concludes with an overview of past and ongoing work, and a short summary of the presented research.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: DE-AC05-00OR22725

OSTI ID:: 1003410

Resource Relation:: Conference: International Symposium on Frontiers in Availability, Reliability and Security (FARES) 2006, Vienna, Austria, 20060420, 20060422

Country of Publication:: United States

Language:: English

Similar Records

Symmetric Active/Active High Availability for High-Performance Computing System Services

Journal Article · Sun Jan 01 00:00:00 EST 2006 · Journal of Computers · OSTI ID:1003410

Engelmann, Christian; Scott, Stephen L; Chokchai, Leangsuksun; +1 more

JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management

Conference · Sun Jan 01 00:00:00 EST 2006 · OSTI ID:1003410

Uhlemann, Kai; Engelmann, Christian; Scott, Steven L

Symmetric Active/Active Metadata Service for High Availability Parallel File Systems

Journal Article · Thu Jan 01 00:00:00 EST 2009 · Journal of Parallel and Distributed Computing · OSTI ID:1003410

He, X.; Ou, Li; Engelmann, Christian; +2 more

Related Subjects

79 ASTROPHYSICS
COSMOLOGY AND ASTRONOMY
ALGORITHMS
ASTROPHYSICS
AVAILABILITY
CHEMISTRY
PERFORMANCE
RELIABILITY
REPAIR
SECURITY
TORQUE

Title: Active/Active Replication for Highly Availalbe HPC System Services

Citation Formats

Similar Records

Related Subjects