Fault Tolerant Communication Runtime Support for Data-Centric Programming Models
The largest supercomputers in the world today consist of hundreds of thousands of processing cores and many more other hardware components. At such scales, hardware faults are a common place necessitating the need for fault-resilient software systems. While different fault resilient models are available today, most of them focus on allowing the computational processes to survive faults. On the other hand, we have recently started investigating fault resilience techniques for data-centric programming models such as the partitioned global address space (PGAS) models. The primary difference in data-centric models is the decoupling of computation and data locality. That is, data placement is decoupled from the executing processes allowing us to view process failure (a physical node hosting a process is dead) separately from data failure (a physical node hosting data is dead). In this paper, we take a first step towards data centric fault resilience by designing and implementing a fault resilient one-sided communication runtime framework using Global Arrays and its communication system, ARMCI. The framework consists of a fault resilient process manager, low overhead and network assisted remote node fault detection module, non-data moving collective communication primitives and providing failure semantics and error codes for one-sided communication runtime systems. Our performance evaluation indicates that our framework incurs little overhead compared to state of the art designs and provides a fundamental framework of fault resiliency for PGAS models. Keywords
- Research Organization:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (US)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1179151
- Report Number(s):
- PNNL-SA-73257
- Country of Publication:
- United States
- Language:
- English
Similar Records
Designing Energy Efficient Communication Runtime Systems for Data Centric Programming Models
On the Suitability of MPI as a PGAS Runtime
Accelerating the Global Arrays ComEx Runtime using Multiple Progress Ranks
Conference
·
Fri Dec 17 23:00:00 EST 2010
·
OSTI ID:1023206
On the Suitability of MPI as a PGAS Runtime
Conference
·
Wed Dec 17 23:00:00 EST 2014
·
OSTI ID:1194324
Accelerating the Global Arrays ComEx Runtime using Multiple Progress Ranks
Conference
·
Mon Dec 16 23:00:00 EST 2019
·
OSTI ID:1598881