Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Software Resilience using Kokkos Ecosystem

Technical Report ·
DOI:https://doi.org/10.2172/1762089· OSTI ID:1762089
 [1];  [1];  [1];  [1]
  1. Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)

Due to the cost of hardware failures within mission critical and scientific applications, it is necessary for software to provide a mechanism to prevent or recover from interruptions. The Kokkos ecosystem is a programming environment that provides performance and portability to many applications that run on DOE supercomputers as well as smaller scale systems. These applications require a higher level of service due to the cost associated with each simulation or the critical nature of the mission. Software resilience enables an application of manage hardware failures reducing the cost of an interruption. Two different resilience methodologies have been added to the Kokkos ecosystem: checkpointing has been added for restart capabilities and a resilient execution model has been added to account for failures in compute devices. The design and implementation of each of these additions are described, and appropriate examples are included for end users.

Research Organization:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States); Sandia National Laboratories, Livermore, CA (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
DOE Contract Number:
AC04-94AL85000; NA0003525
OSTI ID:
1762089
Report Number(s):
SAND--2019-3616; 674319
Country of Publication:
United States
Language:
English

Similar Records

Implementing Software Resiliency in HPX for Extreme Scale Computing
Technical Report · Wed Apr 15 00:00:00 EDT 2020 · OSTI ID:1614897

The Kokkos Ecosystem [Brief]
Technical Report · Sat Aug 01 00:00:00 EDT 2020 · OSTI ID:1656942

Toward Resilient Heterogeneous Computing Workflow through Kokkos-DataSpaces Integration
Technical Report · Mon Nov 30 23:00:00 EST 2020 · OSTI ID:1738875

Related Subjects