Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

rMPI

Software ·
OSTI ID:1231486
As high-performance computing (HPC) machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are unsuitable on their own for exascale machines due to the excessive overheads predicted to more than double an applications time to solution. An alternative mechanism to increase application reliability than just checkpoint-restart alone is redundant computation. The rMPl library enables portable and transparent redundant computation) that, at extreme scale, has significantly lower verhead then just checkpoint-restart on its own.
Short Name / Acronym:
RMPI beta version; 002684MLTPL00
Version:
00
Programming Language(s):
Medium: X; OS: Any Unix-based sysytem; Compatibility: Multiplatform
Research Organization:
Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)
Sponsoring Organization:
USDOE
Contributing Organization:
Kurt B. Ferreira,
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1231486
Country of Origin:
United States

Similar Records

rMPI : increasing fault resiliency in a message-passing environment.
Technical Report · Fri Apr 01 00:00:00 EDT 2011 · OSTI ID:1012733

HPC application fault-tolerance using transparent redundant computation.
Conference · Sat Aug 01 00:00:00 EDT 2009 · OSTI ID:971418

Keeping checkpoint/restart viable for exascale systems.
Technical Report · Thu Sep 01 00:00:00 EDT 2011 · OSTI ID:1029780

Related Subjects