Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Supporting the Development of Resilient Message Passing Applications using Simulation

Conference ·
OSTI ID:1131524
An emerging aspect of high-performance computing (HPC) hardware/software co-design is investigating performance under failure. The work in this paper extends the Extreme-scale Simulator (xSim), which was designed for evaluating the performance of message passing interface (MPI) applications on future HPC architectures, with fault-tolerant MPI extensions proposed by the MPI Fault Tolerance Working Group. xSim permits running MPI applications with millions of concurrent MPI ranks, while observing application performance in a simulated extreme-scale system using a lightweight parallel discrete event simulation. The newly added features offer user-level failure mitigation (ULFM) extensions at the simulated MPI layer to support algorithm-based fault tolerance (ABFT). The presented solution permits investigating performance under failure and failure handling of ABFT solutions. The newly enhanced xSim is the very first performance tool that supports ULFM and ABFT.
Research Organization:
Oak Ridge National Laboratory (ORNL)
Sponsoring Organization:
ORNL LDRD Director's R&D
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1131524
Country of Publication:
United States
Language:
English

Similar Records

Supporting the Development of Soft-Error Resilient Message Passing Applications using Simulation
Conference · Thu Dec 31 23:00:00 EST 2015 · OSTI ID:1241477

Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems
Conference · Mon Dec 31 23:00:00 EST 2012 · OSTI ID:1107829

xSim: The Extreme-Scale Simulator
Conference · Fri Dec 31 23:00:00 EST 2010 · OSTI ID:1023315

Related Subjects