Software Resilience using Kokkos Ecosystem
- Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)
Due to the cost of hardware failures within mission critical and scientific applications, it is necessary for software to provide a mechanism to prevent or recover from interruptions. The Kokkos ecosystem is a programming environment that provides performance and portability to many applications that run on DOE supercomputers as well as smaller scale systems. These applications require a higher level of service due to the cost associated with each simulation or the critical nature of the mission. Software resilience enables an application of manage hardware failures reducing the cost of an interruption. Two different resilience methodologies have been added to the Kokkos ecosystem: checkpointing has been added for restart capabilities and a resilient execution model has been added to account for failures in compute devices. The design and implementation of each of these additions are described, and appropriate examples are included for end users.
- Research Organization:
- Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States); Sandia National Laboratories, Livermore, CA (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
- DOE Contract Number:
- AC04-94AL85000; NA0003525
- OSTI ID:
- 1762089
- Report Number(s):
- SAND--2019-3616; 674319
- Country of Publication:
- United States
- Language:
- English
Similar Records
The Kokkos Ecosystem [Brief]
Toward Resilient Heterogeneous Computing Workflow through Kokkos-DataSpaces Integration