skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Rolex: Resilience-oriented language extensions for extreme-scale systems

Journal Article · · Journal of Supercomputing
 [1];  [1]
  1. Univ. of Southern California, Los Angeles, CA (United States)

Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behavior by the underlying computing system. The mean time to failure (MTTF) of the system scales inversely to the number of components in the system and therefore faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system. In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Furthermore, our experiments show that an approach that leverages the programmer's insight to reason about the context and significance of faults to the application outcome significantly improves the probability that an application runs to a successful conclusion.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Laboratory Directed Research and Development (LDRD) Program
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1259429
Journal Information:
Journal of Supercomputing, Journal Name: Journal of Supercomputing; ISSN 0920-8542
Publisher:
SpringerCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 10 works
Citation information provided by
Web of Science

References (21)

The International Exascale Software Project roadmap journal January 2011
Algorithm-based fault tolerance applied to high performance computing journal April 2009
Static analysis and compiler design for idempotent processing
  • de Kruijf, Marc A.; Sankaralingam, Karthikeyan; Jha, Somesh
  • Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation - PLDI '12 https://doi.org/10.1145/2254064.2254120
conference January 2012
A Case for Soft Error Detection and Correction in Computational Chemistry journal August 2013
Brook for GPUs: stream computing on graphics hardware journal August 2004
Self-stabilizing iterative solvers conference January 2013
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment journal January 2008
Co-array Fortran for parallel programming journal August 1998
Fault tolerant data structures conference January 1996
Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems
  • Chung, Jinsuk; Lee, Ikhwan; Sullivan, Michael
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.36
conference November 2012
A programming model for resilience in extreme scale computing
  • Hukerikar, Saurabh; Diniz, Pedro C.; Lucas, Robert F.
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264671
conference June 2012
Robust graph traversal: Resiliency techniques for data intensive supercomputing conference September 2013
Enabling application resilience through programming model based fault amelioration conference September 2015
Algorithmic approaches to low overhead fault detection for sparse linear algebra
  • Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg
  • 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) https://doi.org/10.1109/DSN.2012.6263938
conference June 2012
An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance conference June 2013
Synthesis of fault tolerant architectures for molecular dynamics conference January 1994
Static analysis and compiler design for idempotent processing journal August 2012
Addressing failures in exascale computing journal March 2014
Brook for GPUs: stream computing on graphics hardware conference January 2004
Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems journal January 2013
Algorithmic Based Fault Tolerance Applied to High Performance Computing preprint January 2008