skip to main content

SciTech ConnectSciTech Connect

This content will become publicly available on May 26, 2017

Title: Rolex: Resilience-oriented language extensions for extreme-scale systems

Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behavior by the underlying computing system. The mean time to failure (MTTF) of the system scales inversely to the number of components in the system and therefore faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system. In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Furthermore, our experiments show that an approach that leverages the programmer's insight to reason aboutmore » the context and significance of faults to the application outcome significantly improves the probability that an application runs to a successful conclusion.« less
 [1] ;  [1]
  1. Univ. of Southern California, Los Angeles, CA (United States)
Publication Date:
OSTI Identifier:
Grant/Contract Number:
Accepted Manuscript
Journal Name:
Journal of Supercomputing
Additional Journal Information:
Journal Name: Journal of Supercomputing; Journal ID: ISSN 0920-8542
Research Org:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org:
ORNL Program Development; USDOE
Country of Publication:
United States
97 MATHEMATICS AND COMPUTING resilience; high-performance computing; exascale computing; programming models; runtime systems; fault tolerance