Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Versioned distributed arrays for resilience in scientific applications: Global view resilience

Journal Article · · Procedia Computer Science
 [1];  [2];  [2];  [1];  [3];  [1];  [2];  [3];  [4];  [4];  [5];  [5];  [6];  [6];  [7];  [7];  [8];  [8];  [8];  [2]
  1. Univ. of Chicago. Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States)
  2. Argonne National Lab. (ANL), Argonne, IL (United States)
  3. Univ. of Chicago. Chicago, IL (United States)
  4. Intel Corp. Santa Clara, CA (United States)
  5. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  6. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  7. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  8. Univ. of Chicago. Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States); Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Intel Corp. Santa Clara, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR’s interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small (<2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads <2% are achieved. We conclude that GVR’s interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.
Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States); Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
Grant/Contract Number:
AC02-05CH11231; AC02-06CH11357; AC04-94AL85000; SC0008603
OSTI ID:
1196863
Alternate ID(s):
OSTI ID: 1214658
OSTI ID: 1528959
Journal Information:
Procedia Computer Science, Journal Name: Procedia Computer Science Journal Issue: C Vol. 51; ISSN 1877-0509
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English

References (13)

Multidimensional upwind methods for hyperbolic conservation laws journal March 1990
Soft error vulnerability of iterative linear algebra methods conference January 2008
Toward resilient algorithms and applications conference January 2013
BTRFS: The Linux B-Tree Filesystem journal August 2013
Verifying quantitative reliability for programs that execute on unreliable hardware journal November 2013
Chisel: reliability- and accuracy-aware optimization of approximate computational kernels journal December 2014
The OpenMC Monte Carlo particle transport code journal January 2013
An overview of the Trilinos project journal September 2005
The Linux implementation of a log-structured file system journal July 2006
The future of microprocessors journal May 2011
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
Co-array Fortran for parallel programming journal August 1998
The Use of Triple-Modular Redundancy to Improve Computer Reliability journal April 1962

Cited By (3)

Failure Recovery in Resilient X10 journal July 2019
Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction journal February 2021
Language Support for Reliable Memory Regions preprint January 2016