Versioned distributed arrays for resilience in scientific applications: Global view resilience
- Univ. of Chicago. Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States)
- Argonne National Lab. (ANL), Argonne, IL (United States)
- Univ. of Chicago. Chicago, IL (United States)
- Intel Corp. Santa Clara, CA (United States)
- Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
- Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Univ. of Chicago. Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States); Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Intel Corp. Santa Clara, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR’s interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small (<2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads <2% are achieved. We conclude that GVR’s interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.
- Research Organization:
- Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- Grant/Contract Number:
- SC0008603; AC02-06CH11357; AC02-05CH11231; AC04-94AL85000
- OSTI ID:
- 1196863
- Alternate ID(s):
- OSTI ID: 1214658
- Journal Information:
- Procedia Computer Science, Vol. 51, Issue C; Conference: International Conference On Computational Science (ICCS 2015). Computational Science at the Gates of Nature, Reykjavik (Iceland), 1-3 Jun 2015; ISSN 1877-0509
- Publisher:
- ElsevierCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Web of Science
Failure Recovery in Resilient X10
|
journal | July 2019 |
Language Support for Reliable Memory Regions | preprint | January 2016 |
Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
|
journal | February 2021 |
Similar Records
Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience
Final Report, “Exploiting Global View for Resilience”