Rolex: Resilience-oriented language extensions for extreme-scale systems
- Univ. of Southern California, Los Angeles, CA (United States)
Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behavior by the underlying computing system. The mean time to failure (MTTF) of the system scales inversely to the number of components in the system and therefore faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system. In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Furthermore, our experiments show that an approach that leverages the programmer's insight to reason about the context and significance of faults to the application outcome significantly improves the probability that an application runs to a successful conclusion.
- Research Organization:
- Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE Laboratory Directed Research and Development (LDRD) Program
- Grant/Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1259429
- Journal Information:
- Journal of Supercomputing, Journal Name: Journal of Supercomputing; ISSN 0920-8542
- Publisher:
- SpringerCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Web of Science
The International Exascale Software Project roadmap
|
journal | January 2011 |
Algorithm-based fault tolerance applied to high performance computing
|
journal | April 2009 |
Static analysis and compiler design for idempotent processing
|
conference | January 2012 |
A Case for Soft Error Detection and Correction in Computational Chemistry
|
journal | August 2013 |
Brook for GPUs: stream computing on graphics hardware
|
journal | August 2004 |
Self-stabilizing iterative solvers
|
conference | January 2013 |
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
|
journal | January 2008 |
Co-array Fortran for parallel programming
|
journal | August 1998 |
Fault tolerant data structures
|
conference | January 1996 |
Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems
|
conference | November 2012 |
A programming model for resilience in extreme scale computing
|
conference | June 2012 |
Robust graph traversal: Resiliency techniques for data intensive supercomputing
|
conference | September 2013 |
Enabling application resilience through programming model based fault amelioration
|
conference | September 2015 |
Algorithmic approaches to low overhead fault detection for sparse linear algebra
|
conference | June 2012 |
An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance
|
conference | June 2013 |
Synthesis of fault tolerant architectures for molecular dynamics
|
conference | January 1994 |
Static analysis and compiler design for idempotent processing
|
journal | August 2012 |
Addressing failures in exascale computing
|
journal | March 2014 |
Brook for GPUs: stream computing on graphics hardware
|
conference | January 2004 |
Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems
|
journal | January 2013 |
Algorithmic Based Fault Tolerance Applied to High Performance Computing | preprint | January 2008 |
Similar Records
PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)