skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Concepts for OpenMP Target Offload Resilience

Abstract

Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale heterogeneous computing. OpenMP currently does not address error and failure behavior. This paper takes a first step toward resilience for heterogeneous systems by providing the concepts for resilient OpenMP offload to devices. Using real-world error and failure observations, the paper describes the concepts and terminology for resilient OpenMP target offload, including error and failure classes and resilience strategies. It details the experienced general-purpose computing graphics processing unit (GPGPU) errors and failures in Titan. It further proposes improvements in OpenMP, including a preliminary prototype design, to support resilient offload to devices for efficient handling of errors and failures in heterogeneous high-performance computing (HPC) systems.

Authors:
ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1570122
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Journal Volume: 11718; Conference: 15th International Workshop on OpenMP (IWOMP 2019) - AUCKLAND, , New Zealand - 9/11/2019 12:00:00 PM-9/13/2019 12:00:00 PM
Country of Publication:
United States
Language:
English

Citation Formats

Engelmann, Christian, Vallee, Geoffroy, and Pophale, Swaroop. Concepts for OpenMP Target Offload Resilience. United States: N. p., 2019. Web. doi:10.1007/978-3-030-28596-8_6.
Engelmann, Christian, Vallee, Geoffroy, & Pophale, Swaroop. Concepts for OpenMP Target Offload Resilience. United States. doi:10.1007/978-3-030-28596-8_6.
Engelmann, Christian, Vallee, Geoffroy, and Pophale, Swaroop. Thu . "Concepts for OpenMP Target Offload Resilience". United States. doi:10.1007/978-3-030-28596-8_6. https://www.osti.gov/servlets/purl/1570122.
@article{osti_1570122,
title = {Concepts for OpenMP Target Offload Resilience},
author = {Engelmann, Christian and Vallee, Geoffroy and Pophale, Swaroop},
abstractNote = {Recent reliability issues with one of the fastest supercomputers in the world, Titan at Oak Ridge National Laboratory (ORNL), demonstrated the need for resilience in large-scale heterogeneous computing. OpenMP currently does not address error and failure behavior. This paper takes a first step toward resilience for heterogeneous systems by providing the concepts for resilient OpenMP offload to devices. Using real-world error and failure observations, the paper describes the concepts and terminology for resilient OpenMP target offload, including error and failure classes and resilience strategies. It details the experienced general-purpose computing graphics processing unit (GPGPU) errors and failures in Titan. It further proposes improvements in OpenMP, including a preliminary prototype design, to support resilient offload to devices for efficient handling of errors and failures in heterogeneous high-performance computing (HPC) systems.},
doi = {10.1007/978-3-030-28596-8_6},
journal = {},
issn = {0302--9743},
number = ,
volume = 11718,
place = {United States},
year = {2019},
month = {8}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: