Rolex: Resilience-oriented language extensions for extreme-scale systems

Lucas, Robert F.; Hukerikar, Saurabh

doi:10.1007/s11227-016-1752-5

Title: Rolex: Resilience-oriented language extensions for extreme-scale systems

Abstract

Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behavior by the underlying computing system. The mean time to failure (MTTF) of the system scales inversely to the number of components in the system and therefore faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system. In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Furthermore, our experiments show that an approach that leverages the programmer's insight to reason aboutmore »« less

Authors:

Lucas, Robert F. ^[1]; Hukerikar, Saurabh ^[1]

Univ. of Southern California, Los Angeles, CA (United States)

Publication Date:: Thu May 26 00:00:00 EDT 2016

Research Org.:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Org.:: USDOE Laboratory Directed Research and Development (LDRD) Program

OSTI Identifier:: 1259429

Grant/Contract Number:: AC05-00OR22725

Resource Type:: Accepted Manuscript

Journal Name:: Journal of Supercomputing

Additional Journal Information:: Journal Name: Journal of Supercomputing; Journal ID: ISSN 0920-8542

Publisher:: Springer

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; resilience; high-performance computing; exascale computing; programming models; runtime systems; fault tolerance

Citation Formats


                    Lucas, Robert F., and Hukerikar, Saurabh. Rolex: Resilience-oriented language extensions for extreme-scale systems.  United States: N. p., 2016. 
Web.  doi:10.1007/s11227-016-1752-5.

Copy to clipboard


                    Lucas, Robert F., & Hukerikar, Saurabh. Rolex: Resilience-oriented language extensions for extreme-scale systems.  United States.  https://doi.org/10.1007/s11227-016-1752-5

Copy to clipboard


                    Lucas, Robert F., and Hukerikar, Saurabh. Thu .  
"Rolex: Resilience-oriented language extensions for extreme-scale systems".  United States.  https://doi.org/10.1007/s11227-016-1752-5.  https://www.osti.gov/servlets/purl/1259429.

Copy to clipboard


                    
@article{osti_1259429,

  title        = {Rolex: Resilience-oriented language extensions for extreme-scale systems},

  author       = {Lucas, Robert F. and Hukerikar, Saurabh},

  abstractNote = {Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behavior by the underlying computing system. The mean time to failure (MTTF) of the system scales inversely to the number of components in the system and therefore faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system. In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Furthermore, our experiments show that an approach that leverages the programmer's insight to reason about the context and significance of faults to the application outcome significantly improves the probability that an application runs to a successful conclusion.},

  doi          = {10.1007/s11227-016-1752-5},

  journal      = {Journal of Supercomputing},

  number       = ,

  volume       = ,

  place        = {United States},

  year         = {Thu May 26 00:00:00 EDT 2016},

  month        = {Thu May 26 00:00:00 EDT 2016}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1007/s11227-016-1752-5

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 10 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

The International Exascale Software Project roadmap
journal, January 2011

Dongarra, Jack; Beckman, Pete; Moore, Terry
The International Journal of High Performance Computing Applications, Vol. 25, Issue 1
DOI: 10.1177/1094342010391989

Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

Bosilca, George; Delmas, Rémi; Dongarra, Jack
Journal of Parallel and Distributed Computing, Vol. 69, Issue 4
DOI: 10.1016/j.jpdc.2008.12.002

Static analysis and compiler design for idempotent processing
conference, January 2012

de Kruijf, Marc A.; Sankaralingam, Karthikeyan; Jha, Somesh
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation - PLDI '12
DOI: 10.1145/2254064.2254120

A Case for Soft Error Detection and Correction in Computational Chemistry
journal, August 2013

van Dam, Hubertus J. J.; Vishnu, Abhinav; de Jong, Wibe A.
Journal of Chemical Theory and Computation, Vol. 9, Issue 9
DOI: 10.1021/ct400489c

Brook for GPUs: stream computing on graphics hardware
journal, August 2004

Buck, Ian; Foley, Tim; Horn, Daniel
ACM Transactions on Graphics, Vol. 23, Issue 3
DOI: 10.1145/1015706.1015800

Self-stabilizing iterative solvers
conference, January 2013

Sao, Piyush; Vuduc, Richard
Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '13
DOI: 10.1145/2530268.2530272

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
journal, January 2008

Langou, J.; Chen, Z.; Bosilca, G.
SIAM Journal on Scientific Computing, Vol. 30, Issue 1
DOI: 10.1137/040620394

Co-array Fortran for parallel programming
journal, August 1998

Numrich, Robert W.; Reid, John
ACM SIGPLAN Fortran Forum, Vol. 17, Issue 2
DOI: 10.1145/289918.289920

Fault tolerant data structures
conference, January 1996

Aumann, Y.; Bender, M. A.
Proceedings of 37th Conference on Foundations of Computer Science
DOI: 10.1109/SFCS.1996.548517

Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems
conference, November 2012

Chung, Jinsuk; Lee, Ikhwan; Sullivan, Michael
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2012.36

A programming model for resilience in extreme scale computing
conference, June 2012

Hukerikar, Saurabh; Diniz, Pedro C.; Lucas, Robert F.
2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
DOI: 10.1109/DSNW.2012.6264671

Robust graph traversal: Resiliency techniques for data intensive supercomputing
conference, September 2013

Hukerikar, Saurabh; Diniz, Pedro C.; Lucas, Robert F.
2013 IEEE High Performance Extreme Computing Conference (HPEC)
DOI: 10.1109/HPEC.2013.6670340

Enabling application resilience through programming model based fault amelioration
conference, September 2015

Hukerikar, Saurabh; Diniz, Pedro C.; Lucas, Robert F.
2015 IEEE High Performance Extreme Computing Conference (HPEC)
DOI: 10.1109/HPEC.2015.7322460

Algorithmic approaches to low overhead fault detection for sparse linear algebra
conference, June 2012

Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg
2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)
DOI: 10.1109/DSN.2012.6263938

An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance
conference, June 2013

Sloan, Joseph; Kumar, Rakesh; Bronevetsky, Greg
2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
DOI: 10.1109/DSN.2013.6575309

Synthesis of fault tolerant architectures for molecular dynamics
conference, January 1994

Yajnik, S.; Jha, N. K.
Proceedings of IEEE International Symposium on Circuits and Systems - ISCAS '94
DOI: 10.1109/ISCAS.1994.409243

Static analysis and compiler design for idempotent processing
journal, August 2012

de Kruijf, Marc A.; Sankaralingam, Karthikeyan; Jha, Somesh
ACM SIGPLAN Notices, Vol. 47, Issue 6
DOI: 10.1145/2345156.2254120

Addressing failures in exascale computing
journal, March 2014

Snir, Marc; Wisniewski, Robert W.; Abraham, Jacob A.
The International Journal of High Performance Computing Applications, Vol. 28, Issue 2
DOI: 10.1177/1094342014522573

Brook for GPUs: stream computing on graphics hardware
conference, January 2004

Buck, Ian; Foley, Tim; Horn, Daniel
ACM SIGGRAPH 2004 Papers on - SIGGRAPH '04
DOI: 10.1145/1186562.1015800

Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems
journal, January 2013

Chung, Jinsuk; Lee, Ikhwan; Sullivan, Michael
Scientific Programming, Vol. 21, Issue 3-4
DOI: 10.1155/2013/473915

Algorithmic Based Fault Tolerance Applied to High Performance Computing
preprint, January 2008

Bosilca, George; Delmas, Remi; Dongarra, Jack
arXiv
DOI: 10.48550/arxiv.0806.3121

Similar Records in DOE PAGES and OSTI.GOV collections:

Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact. Final report

Technical Report Kramer, William ; Jha, Saurabh ; Brandt, James ; ...

For HPC systems to date, application resilience to faults and failures has been accomplished by the brute- force method of checkpoint/restart, which allows an application to make forward progress in the face of system and application faults, errors, and failures independent of root cause or end result. It has remained the primary resilience mechanism because we lack a way to identify faults and anticipate consequences early enough to take meaningful mitigating action. However, checkpoint/restart implementations put a tremendous burden on system resources and on the applications themselves and is becoming less feasible at scale. Because we have not yet operatedmore »« less
https://doi.org/10.2172/1615150

Full Text Available
PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems

Conference Hukerikar, Saurabh ; Engelmann, Christian

For high-performance computing (HPC) system designers and users, meeting the myriad challenges of next-generation exascale supercomputing systems requires rethinking their approach to application and system software design. Among these challenges, providing resiliency and stability to the scientific applications in the presence of high fault rates requires new approaches to software architecture and design. As HPC systems become increasingly complex, they require intricate solutions for detection and mitigation for various modes of faults and errors that occur in these large-scale systems, as well as solutions for failure recovery. These resiliency solutions often interact with and affect other system properties, including applicationmore »« less
https://doi.org/10.1109/PRDC50213.2020.00014

Full Text Available
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)

Technical Report Engelmann, Christian ; Ashraf, Rizwan ; Hukerikar, Saurabh ; ...

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Therefore, the resilience challenge for extreme-scale HPC systems requires coordination between various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. Also, due to practical limits on powermore »« less
https://doi.org/10.2172/1922296

Full Text Available
A Pattern Language for High-Performance Computing Resilience

Conference Hukerikar, Saurabh ; Engelmann, Christian

High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their reliable operation in the face of system degradations and failures is a critical challenge. System fault events often lead the scientific applications to produce incorrect results, or may even cause their untimely termination. The sheer numbermore »« less
https://doi.org/10.1145/3147704.3147718

Full Text Available
Exascale Operating Systems and Runtime Software Report

Technical Report Beckman, Pete ; Brightwell, Ron ; Gokhale, Maya ; ...

Here U.S. Department of Energy (DOE) workshops and reports have identified four key exascale challenges: dramatically improving power efficiency; improving resilience in the presence of increasing faults; enabling efficient data movement across deepening memory hierarchies and new storage technologies; and managing dramatically increased parallelism, especially at the node level. Software solutions that address these challenges must also improve programmability, expanding the community of computational scientists who can use leadership-class platforms. To address these challenges, DOE must develop new techniques, novel designs, and advanced software architectures for next-generation exascale software infrastructure. In this report, we discuss challenges and approaches to exascalemore »« less
https://doi.org/10.2172/1471119

Full Text Available

Similar Records

Title: Rolex: Resilience-oriented language extensions for extreme-scale systems

Abstract

Citation Formats

The International Exascale Software Project roadmap journal, January 2011

Algorithm-based fault tolerance applied to high performance computing journal, April 2009

Static analysis and compiler design for idempotent processing conference, January 2012

A Case for Soft Error Detection and Correction in Computational Chemistry journal, August 2013

Brook for GPUs: stream computing on graphics hardware journal, August 2004

Self-stabilizing iterative solvers conference, January 2013

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment journal, January 2008

Co-array Fortran for parallel programming journal, August 1998

Fault tolerant data structures conference, January 1996

Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems conference, November 2012

A programming model for resilience in extreme scale computing conference, June 2012

Robust graph traversal: Resiliency techniques for data intensive supercomputing conference, September 2013

Enabling application resilience through programming model based fault amelioration conference, September 2015

Algorithmic approaches to low overhead fault detection for sparse linear algebra conference, June 2012

An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance conference, June 2013

Synthesis of fault tolerant architectures for molecular dynamics conference, January 1994

Static analysis and compiler design for idempotent processing journal, August 2012

Addressing failures in exascale computing journal, March 2014

Brook for GPUs: stream computing on graphics hardware conference, January 2004

Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems journal, January 2013

Algorithmic Based Fault Tolerance Applied to High Performance Computing preprint, January 2008

The International Exascale Software Project roadmap
journal, January 2011

Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

Static analysis and compiler design for idempotent processing
conference, January 2012

A Case for Soft Error Detection and Correction in Computational Chemistry
journal, August 2013

Brook for GPUs: stream computing on graphics hardware
journal, August 2004

Self-stabilizing iterative solvers
conference, January 2013

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
journal, January 2008

Co-array Fortran for parallel programming
journal, August 1998

Fault tolerant data structures
conference, January 1996

Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems
conference, November 2012

A programming model for resilience in extreme scale computing
conference, June 2012

Robust graph traversal: Resiliency techniques for data intensive supercomputing
conference, September 2013

Enabling application resilience through programming model based fault amelioration
conference, September 2015

Algorithmic approaches to low overhead fault detection for sparse linear algebra
conference, June 2012

An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance
conference, June 2013

Synthesis of fault tolerant architectures for molecular dynamics
conference, January 1994

Static analysis and compiler design for idempotent processing
journal, August 2012

Addressing failures in exascale computing
journal, March 2014

Brook for GPUs: stream computing on graphics hardware
conference, January 2004

Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems
journal, January 2013

Algorithmic Based Fault Tolerance Applied to High Performance Computing
preprint, January 2008