Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Supporting the Development of Soft-Error Resilient Message Passing Applications using Simulation

Conference ·
OSTI ID:1241477
Radiation-induced bit flip faults are of particular concern in extreme-scale high-performance computing systems. This paper presents a simulation-based tool that enables the development of soft-error resilient message passing applications by permitting the investigation of their correctness and performance under various fault conditions. The documented extensions to the Extreme-scale Simulator (xSim) enable the injection of bit flip faults at specific of injection location(s) and fault activation time(s), while supporting a significant degree of configurability of the fault type. Experiments show that the simulation overhead with the new feature is ~2,325% for serial execution and ~1,730% at 128 MPI processes, both with very fine-grain fault injection. Fault injection experiments demonstrate the usefulness of the new feature by injecting bit flips in the input and output matrices of a matrix-matrix multiply application, revealing vulnerability of data structures, masking and error propagation. xSim is the very first simulation-based MPI performance tool that supports both, the injection of process failures and bit flip faults.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1241477
Country of Publication:
United States
Language:
English

Similar Records

Supporting the Development of Resilient Message Passing Applications using Simulation
Conference · Tue Dec 31 23:00:00 EST 2013 · OSTI ID:1131524

A new deadlock resolution protocol and message matching algorithm for the extreme-scale simulator
Journal Article · Mon Mar 21 20:00:00 EDT 2016 · Concurrency and Computation. Practice and Experience · OSTI ID:1286913

Related Subjects