Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems
- ORNL
This paper presents various aspects of reliability, availability and serviceability (RAS) systems as they relate to group communication service, including reliable and total order multicast/broadcast, virtual synchrony, and failure detection. While the issue of availability, particularly high availability using replication-based architectures has recently received upsurge research interests, much still have to be done in understanding the basic underlying concepts for achieving RAS systems, especially in high-end and high performance computing (HPC) communities. Various attributes of group communication service and the prototype of symmetric active replication following ideas utilized in the Newtop protocol will be discussed. We explore the application of group communication service for RAS HPC, laying the groundwork for its integrated model.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Laboratory Directed Research and Development (LDRD) Program
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 989650
- Resource Relation:
- Conference: 2nd International Conference on Computer Science and Information Systems 2006, Athens, Greece, 20060619, 20060621
- Country of Publication:
- United States
- Language:
- English
Similar Records
Symmetric Active/Active Metadata Service for High Availability Parallel File Systems
Active/Active Replication for Highly Availalbe HPC System Services