The Case for Modular Redundancy in Large-Scale High Performance Computing Systems

Engelmann, Christian; Ong, Hong Hoe; Scott, Stephen L

Title: The Case for Modular Redundancy in Large-Scale High Performance Computing Systems

Conference · Thu Jan 01 00:00:00 EST 2009

OSTI ID:979151

Engelmann, Christian ^[1]; Ong, Hong Hoe ^[1]; Scott, Stephen L ^[1]

ORNL

Recent investigations into resilience of large-scale high-performance computing (HPC) systems showed a continuous trend of decreasing reliability and availability. Newly installed systems have a lower mean-time to failure (MTTF) and a higher mean-time to recover (MTTR) than their predecessors. Modular redundancy is being used in many mission critical systems today to provide for resilience, such as for aerospace and command \& control systems. The primary argument against modular redundancy for resilience in HPC has always been that the capability of a HPC system, and respective return on investment, would be significantly reduced. We argue that modular redundancy can significantly increase compute node availability as it removes the impact of scale from single compute node MTTR. We further argue that single compute nodes can be much less reliable, and therefore less expensive, and still be highly available, if their MTTR/MTTF ratio is maintained.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Office of Science (SC)

DOE Contract Number:: DE-AC05-00OR22725

OSTI ID:: 979151

Resource Relation:: Conference: 27th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, Innsbruck, Austria, 20090216, 20090218

Country of Publication:: United States

Language:: English

Similar Records

Redundant Execution of HPC Applications with MR-MPI

Conference · Sat Jan 01 00:00:00 EST 2011 · OSTI ID:979151

Engelmann, Christian; Boehm, Swen

Rolex: Resilience-oriented language extensions for extreme-scale systems

Journal Article · Thu May 26 00:00:00 EDT 2016 · Journal of Supercomputing · OSTI ID:979151

Lucas, Robert F.; Hukerikar, Saurabh

File I/O for MPI Applications in Redundant Execution Scenarios

Conference · Sun Jan 01 00:00:00 EST 2012 · OSTI ID:979151

Boehm, Swen; Engelmann, Christian

Related Subjects

99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE
AVAILABILITY
CONTROL SYSTEMS
PERFORMANCE
REDUNDANCY
RELIABILITY
high-performance computing
modular redundancy
fault tolerance
high availability
reliability

Title: The Case for Modular Redundancy in Large-Scale High Performance Computing Systems

Citation Formats

Similar Records

Related Subjects