skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: ASC CSSE Level 2 Milestone #6362: Resilient Asynchronous Many Task Programming Model

Technical Report ·
DOI:https://doi.org/10.2172/1469062· OSTI ID:1469062
 [1];  [1];  [1];  [1];  [1];  [1];  [2];  [2];  [3]
  1. Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
  2. Rice University, Houston, TX (United States)
  3. Rice University, Houston, TX (United States); Georgia Institute of Technology, Atlanta, GA (United States)

This report is an outcome of the ASC CSSE Level 2 Milestone 6362: Analysis of Re- silient Asynchronous Many-Task (AMT) Programming Model. It comprises a summary and in-depth analysis of resilience schemes adapted to the AMT programming model. Herein, performance trade-offs of a resilient-AMT prograrnming model are assessed through two ap- proaches: (1) an analytical model realized by discrete event simulations and (2) empirical evaluation of benchmark programs representing regular and irregular workloads of explicit partial differential equation solvers. As part of this effort, an AMT execution simulator and a prototype resilient-AMT programming framework have been developed. The former permits us to hypothesize the performance behavior of a resilient-AMT model, and has undergone a verification and validation (V&V) process. The latter allows empirical evaluation of the perfor- mance of resilience schemes under emulated program failures and enabled the aforementioned V&V process. The outcome indicates that (1) resilience techniques implemented within an AMT framework allow efficient and scalable recovery under frequent failures, that (2) the abstraction of task and data instances in the AMT programming model enables readily us- able Application Program Interfaces (APIs) for resilience, and that (3) this abstraction enables predicting the performance of resilient-AMT applications with a simple simulation infrastruc- ture. This outcome will provide guidance for the design of the AMT programming model and runtime systems, user-level resilience support, and application development for ASC's next generation platforms (NGPs).

Research Organization:
Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
DOE Contract Number:
AC04-94AL85000; NA0003525
OSTI ID:
1469062
Report Number(s):
SAND-2018-9672; 667589
Country of Publication:
United States
Language:
English