Rolex: Resilience-oriented language extensions for extreme-scale systems

Lucas, Robert F.; Hukerikar, Saurabh

doi:10.1007/s11227-016-1752-5

Title: Rolex: Resilience-oriented language extensions for extreme-scale systems

Journal Article · Thu May 26 00:00:00 EDT 2016 · Journal of Supercomputing

DOI:https://doi.org/10.1007/s11227-016-1752-5· OSTI ID:1259429

Lucas, Robert F. ^[1]; Hukerikar, Saurabh ^[1]

Univ. of Southern California, Los Angeles, CA (United States)

Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behavior by the underlying computing system. The mean time to failure (MTTF) of the system scales inversely to the number of components in the system and therefore faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system. In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Furthermore, our experiments show that an approach that leverages the programmer's insight to reason about the context and significance of faults to the application outcome significantly improves the probability that an application runs to a successful conclusion.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE Laboratory Directed Research and Development (LDRD) Program

Grant/Contract Number:: AC05-00OR22725

OSTI ID:: 1259429

Journal Information:: Journal of Supercomputing, Journal Name: Journal of Supercomputing; ISSN 0920-8542

Publisher:: SpringerCopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 10 works

Citation information provided by
Web of Science

References (21)

The International Exascale Software Project roadmap Dongarra, Jack; Beckman, Pete; Moore, Terry The International Journal of High Performance Computing Applications, Vol. 25, Issue 1 https://doi.org/10.1177/1094342010391989	journal	January 2011
Algorithm-based fault tolerance applied to high performance computing Bosilca, George; Delmas, Rémi; Dongarra, Jack Journal of Parallel and Distributed Computing, Vol. 69, Issue 4 https://doi.org/10.1016/j.jpdc.2008.12.002	journal	April 2009
Static analysis and compiler design for idempotent processing de Kruijf, Marc A.; Sankaralingam, Karthikeyan; Jha, Somesh Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation - PLDI '12 https://doi.org/10.1145/2254064.2254120	conference	January 2012
A Case for Soft Error Detection and Correction in Computational Chemistry van Dam, Hubertus J. J.; Vishnu, Abhinav; de Jong, Wibe A. Journal of Chemical Theory and Computation, Vol. 9, Issue 9 https://doi.org/10.1021/ct400489c	journal	August 2013
Brook for GPUs: stream computing on graphics hardware Buck, Ian; Foley, Tim; Horn, Daniel ACM Transactions on Graphics, Vol. 23, Issue 3 https://doi.org/10.1145/1015706.1015800	journal	August 2004
Self-stabilizing iterative solvers Sao, Piyush; Vuduc, Richard Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '13 https://doi.org/10.1145/2530268.2530272	conference	January 2013
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment Langou, J.; Chen, Z.; Bosilca, G. SIAM Journal on Scientific Computing, Vol. 30, Issue 1 https://doi.org/10.1137/040620394	journal	January 2008
Co-array Fortran for parallel programming Numrich, Robert W.; Reid, John ACM SIGPLAN Fortran Forum, Vol. 17, Issue 2 https://doi.org/10.1145/289918.289920	journal	August 1998
Fault tolerant data structures Aumann, Y.; Bender, M. A. Proceedings of 37th Conference on Foundations of Computer Science https://doi.org/10.1109/SFCS.1996.548517	conference	January 1996
Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems Chung, Jinsuk; Lee, Ikhwan; Sullivan, Michael 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.36	conference	November 2012
A programming model for resilience in extreme scale computing Hukerikar, Saurabh; Diniz, Pedro C.; Lucas, Robert F. 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264671	conference	June 2012
Robust graph traversal: Resiliency techniques for data intensive supercomputing Hukerikar, Saurabh; Diniz, Pedro C.; Lucas, Robert F. 2013 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC.2013.6670340	conference	September 2013
Enabling application resilience through programming model based fault amelioration Hukerikar, Saurabh; Diniz, Pedro C.; Lucas, Robert F. 2015 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC.2015.7322460	conference	September 2015
Algorithmic approaches to low overhead fault detection for sparse linear algebra Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) https://doi.org/10.1109/DSN.2012.6263938	conference	June 2012
An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2013.6575309	conference	June 2013
Synthesis of fault tolerant architectures for molecular dynamics Yajnik, S.; Jha, N. K. Proceedings of IEEE International Symposium on Circuits and Systems - ISCAS '94 https://doi.org/10.1109/ISCAS.1994.409243	conference	January 1994
Static analysis and compiler design for idempotent processing de Kruijf, Marc A.; Sankaralingam, Karthikeyan; Jha, Somesh ACM SIGPLAN Notices, Vol. 47, Issue 6 https://doi.org/10.1145/2345156.2254120	journal	August 2012
Addressing failures in exascale computing Snir, Marc; Wisniewski, Robert W.; Abraham, Jacob A. The International Journal of High Performance Computing Applications, Vol. 28, Issue 2 https://doi.org/10.1177/1094342014522573	journal	March 2014
Brook for GPUs: stream computing on graphics hardware Buck, Ian; Foley, Tim; Horn, Daniel ACM SIGGRAPH 2004 Papers on - SIGGRAPH '04 https://doi.org/10.1145/1186562.1015800	conference	January 2004
Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems Chung, Jinsuk; Lee, Ikhwan; Sullivan, Michael Scientific Programming, Vol. 21, Issue 3-4 https://doi.org/10.1155/2013/473915	journal	January 2013
Algorithmic Based Fault Tolerance Applied to High Performance Computing Bosilca, George; Delmas, Remi; Dongarra, Jack arXiv https://doi.org/10.48550/arxiv.0806.3121	preprint	January 2008

Similar Records

Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact. Final report

Technical Report · Thu Apr 16 00:00:00 EDT 2020 · OSTI ID:1259429

Kramer, William; Jha, Saurabh; Brandt, James; +1 more

PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems

Conference · Tue Dec 01 00:00:00 EST 2020 · OSTI ID:1259429

Hukerikar, Saurabh; Engelmann, Christian

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)

Technical Report · Fri Dec 16 00:00:00 EST 2022 · OSTI ID:1259429

Engelmann, Christian; Ashraf, Rizwan; Hukerikar, Saurabh; +2 more

Related Subjects

97 MATHEMATICS AND COMPUTING
resilience
high-performance computing
exascale computing
programming models
runtime systems
fault tolerance

Title: Rolex: Resilience-oriented language extensions for extreme-scale systems

Citation Formats

References (21)

Similar Records

Related Subjects