Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

rMPI : increasing fault resiliency in a message-passing environment.

Technical Report ·
DOI:https://doi.org/10.2172/1012733· OSTI ID:1012733
As High-End Computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are unsuitable at these scale due to excessive overheads predicted to more than double an applications time to solution. Redundant computation, long used in distributed and mission critical systems, has been suggested as an alternative to checkpoint-restart on its own. In this paper we describe the rMPI library which enables portable and transparent redundant computation for MPI applications. We detail the design of the library as well as two replica consistency protocols, outline the overheads of this library at scale on a number of real-world applications, and finally outline the significant increase in an applications time to solution at extreme scale as well as show the scenarios in which redundant computation makes sense.
Research Organization:
Sandia National Laboratories
Sponsoring Organization:
USDOE
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1012733
Report Number(s):
SAND2011-2488
Country of Publication:
United States
Language:
English

Similar Records

Increasing fault resiliency in a message-passing environment.
Technical Report · Thu Oct 01 00:00:00 EDT 2009 · OSTI ID:1001015

rMPI
Software · Tue Aug 24 00:00:00 EDT 2010 · OSTI ID:1231486

HPC application fault-tolerance using transparent redundant computation.
Conference · Sat Aug 01 00:00:00 EDT 2009 · OSTI ID:971418