Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems
Conference
·
OSTI ID:1107829
- ORNL
xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.
- Research Organization:
- Oak Ridge National Laboratory (ORNL)
- Sponsoring Organization:
- ORNL LDRD Director's R&D
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1107829
- Country of Publication:
- United States
- Language:
- English
Similar Records
Supporting the Development of Resilient Message Passing Applications using Simulation
Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale
A Network Contention Model for the Extreme-scale Simulator
Conference
·
Tue Dec 31 23:00:00 EST 2013
·
OSTI ID:1131524
Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale
Journal Article
·
Tue Dec 31 23:00:00 EST 2013
· Future Generation Computer Systems
·
OSTI ID:1107826
A Network Contention Model for the Extreme-scale Simulator
Conference
·
Wed Dec 31 23:00:00 EST 2014
·
OSTI ID:1185871