skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations

Conference ·
OSTI ID:931290
 [1];  [1];  [2];  [3]
  1. ORNL
  2. Louisiana Tech University
  3. Tennessee Technological University

During the last several years, our teams at Oak Ridge National Laboratory, Louisiana Tech University, and Tennessee Technological University focused on efficient redundancy strategies for head and service nodes of high-performance computing (HPC) systems in order to pave the way for high availability (HA) in HPC. These nodes typically run critical HPC system services, like job and resource management, and represent single points of failure and control for an entire HPC system. The overarching goal of our research is to provide high-level reliability, availability, and serviceability (RAS) for HPC systems by combining HA and HPC technology. This paper summarizes our accomplishments, such as developed concepts and implemented proof-of-concept prototypes, and describes existing limitations, such as performance issues, which need to be dealt with for production-type deployment.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE Office of Science (SC)
DOE Contract Number:
DE-AC05-00OR22725
OSTI ID:
931290
Resource Relation:
Conference: High Availability and Performance Workshop (HAPCW) 2006, Santa Fe, NM, USA, 20061017, 20061017
Country of Publication:
United States
Language:
English