Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems

Conference ·
OSTI ID:989650

This paper presents various aspects of reliability, availability and serviceability (RAS) systems as they relate to group communication service, including reliable and total order multicast/broadcast, virtual synchrony, and failure detection. While the issue of availability, particularly high availability using replication-based architectures has recently received upsurge research interests, much still have to be done in understanding the basic underlying concepts for achieving RAS systems, especially in high-end and high performance computing (HPC) communities. Various attributes of group communication service and the prototype of symmetric active replication following ideas utilized in the Newtop protocol will be discussed. We explore the application of group communication service for RAS HPC, laying the groundwork for its integrated model.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
ORNL LDRD Director's R&D
DOE Contract Number:
AC05-00OR22725
OSTI ID:
989650
Country of Publication:
United States
Language:
English

Similar Records

Symmetric Active/Active High Availability for High-Performance Computing System Services
Journal Article · Sat Dec 31 23:00:00 EST 2005 · Journal of Computers · OSTI ID:978718

Symmetric Active/Active Metadata Service for High Availability Parallel File Systems
Journal Article · Wed Dec 31 23:00:00 EST 2008 · Journal of Parallel and Distributed Computing · OSTI ID:979283

The intergroup protocols: Scalable group communication for the internet
Thesis/Dissertation · Sun Dec 03 23:00:00 EST 2000 · OSTI ID:775165

Related Subjects