rMPI
Software
·
OSTI ID:1231486
As high-performance computing (HPC) machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are unsuitable on their own for exascale machines due to the excessive overheads predicted to more than double an applications time to solution. An alternative mechanism to increase application reliability than just checkpoint-restart alone is redundant computation. The rMPl library enables portable and transparent redundant computation) that, at extreme scale, has significantly lower verhead then just checkpoint-restart on its own.
- Short Name / Acronym:
- RMPI beta version; 002684MLTPL00
- Version:
- 00
- Programming Language(s):
- Medium: X; OS: Any Unix-based sysytem; Compatibility: Multiplatform
- Research Organization:
- Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)
- Sponsoring Organization:
- USDOE
- Contributing Organization:
- Kurt B. Ferreira,
- DOE Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1231486
- Country of Origin:
- United States
Similar Records
rMPI : increasing fault resiliency in a message-passing environment.
HPC application fault-tolerance using transparent redundant computation.
Keeping checkpoint/restart viable for exascale systems.
Technical Report
·
Fri Apr 01 00:00:00 EDT 2011
·
OSTI ID:1231486
+4 more
HPC application fault-tolerance using transparent redundant computation.
Conference
·
Sat Aug 01 00:00:00 EDT 2009
·
OSTI ID:1231486
+3 more
Keeping checkpoint/restart viable for exascale systems.
Technical Report
·
Thu Sep 01 00:00:00 EDT 2011
·
OSTI ID:1231486
+6 more