DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Rolex: Resilience-oriented language extensions for extreme-scale systems

Abstract

Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behavior by the underlying computing system. The mean time to failure (MTTF) of the system scales inversely to the number of components in the system and therefore faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system. In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Furthermore, our experiments show that an approach that leverages the programmer's insight to reason aboutmore » the context and significance of faults to the application outcome significantly improves the probability that an application runs to a successful conclusion.« less

Authors:
 [1];  [1]
  1. Univ. of Southern California, Los Angeles, CA (United States)
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
1259429
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Supercomputing
Additional Journal Information:
Journal Name: Journal of Supercomputing; Journal ID: ISSN 0920-8542
Publisher:
Springer
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; resilience; high-performance computing; exascale computing; programming models; runtime systems; fault tolerance

Citation Formats

Lucas, Robert F., and Hukerikar, Saurabh. Rolex: Resilience-oriented language extensions for extreme-scale systems. United States: N. p., 2016. Web. doi:10.1007/s11227-016-1752-5.
Lucas, Robert F., & Hukerikar, Saurabh. Rolex: Resilience-oriented language extensions for extreme-scale systems. United States. https://doi.org/10.1007/s11227-016-1752-5
Lucas, Robert F., and Hukerikar, Saurabh. Thu . "Rolex: Resilience-oriented language extensions for extreme-scale systems". United States. https://doi.org/10.1007/s11227-016-1752-5. https://www.osti.gov/servlets/purl/1259429.
@article{osti_1259429,
title = {Rolex: Resilience-oriented language extensions for extreme-scale systems},
author = {Lucas, Robert F. and Hukerikar, Saurabh},
abstractNote = {Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behavior by the underlying computing system. The mean time to failure (MTTF) of the system scales inversely to the number of components in the system and therefore faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system. In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Furthermore, our experiments show that an approach that leverages the programmer's insight to reason about the context and significance of faults to the application outcome significantly improves the probability that an application runs to a successful conclusion.},
doi = {10.1007/s11227-016-1752-5},
journal = {Journal of Supercomputing},
number = ,
volume = ,
place = {United States},
year = {Thu May 26 00:00:00 EDT 2016},
month = {Thu May 26 00:00:00 EDT 2016}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 10 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

The International Exascale Software Project roadmap
journal, January 2011

  • Dongarra, Jack; Beckman, Pete; Moore, Terry
  • The International Journal of High Performance Computing Applications, Vol. 25, Issue 1
  • DOI: 10.1177/1094342010391989

Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

  • Bosilca, George; Delmas, RĂ©mi; Dongarra, Jack
  • Journal of Parallel and Distributed Computing, Vol. 69, Issue 4
  • DOI: 10.1016/j.jpdc.2008.12.002

Static analysis and compiler design for idempotent processing
conference, January 2012

  • de Kruijf, Marc A.; Sankaralingam, Karthikeyan; Jha, Somesh
  • Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation - PLDI '12
  • DOI: 10.1145/2254064.2254120

A Case for Soft Error Detection and Correction in Computational Chemistry
journal, August 2013

  • van Dam, Hubertus J. J.; Vishnu, Abhinav; de Jong, Wibe A.
  • Journal of Chemical Theory and Computation, Vol. 9, Issue 9
  • DOI: 10.1021/ct400489c

Brook for GPUs: stream computing on graphics hardware
journal, August 2004


Self-stabilizing iterative solvers
conference, January 2013

  • Sao, Piyush; Vuduc, Richard
  • Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '13
  • DOI: 10.1145/2530268.2530272

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
journal, January 2008

  • Langou, J.; Chen, Z.; Bosilca, G.
  • SIAM Journal on Scientific Computing, Vol. 30, Issue 1
  • DOI: 10.1137/040620394

Co-array Fortran for parallel programming
journal, August 1998


Fault tolerant data structures
conference, January 1996

  • Aumann, Y.; Bender, M. A.
  • Proceedings of 37th Conference on Foundations of Computer Science
  • DOI: 10.1109/SFCS.1996.548517

Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems
conference, November 2012

  • Chung, Jinsuk; Lee, Ikhwan; Sullivan, Michael
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2012.36

A programming model for resilience in extreme scale computing
conference, June 2012

  • Hukerikar, Saurabh; Diniz, Pedro C.; Lucas, Robert F.
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
  • DOI: 10.1109/DSNW.2012.6264671

Robust graph traversal: Resiliency techniques for data intensive supercomputing
conference, September 2013

  • Hukerikar, Saurabh; Diniz, Pedro C.; Lucas, Robert F.
  • 2013 IEEE High Performance Extreme Computing Conference (HPEC)
  • DOI: 10.1109/HPEC.2013.6670340

Enabling application resilience through programming model based fault amelioration
conference, September 2015

  • Hukerikar, Saurabh; Diniz, Pedro C.; Lucas, Robert F.
  • 2015 IEEE High Performance Extreme Computing Conference (HPEC)
  • DOI: 10.1109/HPEC.2015.7322460

Algorithmic approaches to low overhead fault detection for sparse linear algebra
conference, June 2012

  • Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg
  • 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)
  • DOI: 10.1109/DSN.2012.6263938

An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance
conference, June 2013

  • Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg
  • 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • DOI: 10.1109/DSN.2013.6575309

Synthesis of fault tolerant architectures for molecular dynamics
conference, January 1994

  • Yajnik, S.; Jha, N. K.
  • Proceedings of IEEE International Symposium on Circuits and Systems - ISCAS '94
  • DOI: 10.1109/ISCAS.1994.409243

Static analysis and compiler design for idempotent processing
journal, August 2012

  • de Kruijf, Marc A.; Sankaralingam, Karthikeyan; Jha, Somesh
  • ACM SIGPLAN Notices, Vol. 47, Issue 6
  • DOI: 10.1145/2345156.2254120

Addressing failures in exascale computing
journal, March 2014

  • Snir, Marc; Wisniewski, Robert W.; Abraham, Jacob A.
  • The International Journal of High Performance Computing Applications, Vol. 28, Issue 2
  • DOI: 10.1177/1094342014522573

Brook for GPUs: stream computing on graphics hardware
conference, January 2004


Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems
journal, January 2013

  • Chung, Jinsuk; Lee, Ikhwan; Sullivan, Michael
  • Scientific Programming, Vol. 21, Issue 3-4
  • DOI: 10.1155/2013/473915

Algorithmic Based Fault Tolerance Applied to High Performance Computing
preprint, January 2008