skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Versioned distributed arrays for resilience in scientific applications: Global view resilience

Journal Article · · Procedia Computer Science
 [1];  [2];  [2];  [1];  [3];  [1];  [2];  [3];  [4];  [4];  [5];  [5];  [6];  [6];  [7];  [7];  [8];  [8];  [8];  [2]
  1. Univ. of Chicago. Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States)
  2. Argonne National Lab. (ANL), Argonne, IL (United States)
  3. Univ. of Chicago. Chicago, IL (United States)
  4. Intel Corp. Santa Clara, CA (United States)
  5. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  6. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  7. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  8. Univ. of Chicago. Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States); Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Intel Corp. Santa Clara, CA (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR’s interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small (<2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads <2% are achieved. We conclude that GVR’s interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
SC0008603; AC02-06CH11357; AC02-05CH11231; AC04-94AL85000
OSTI ID:
1196863
Alternate ID(s):
OSTI ID: 1214658
Journal Information:
Procedia Computer Science, Vol. 51, Issue C; Conference: International Conference On Computational Science (ICCS 2015). Computational Science at the Gates of Nature, Reykjavik (Iceland), 1-3 Jun 2015; ISSN 1877-0509
Publisher:
ElsevierCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 12 works
Citation information provided by
Web of Science

References (13)

Multidimensional upwind methods for hyperbolic conservation laws journal March 1990
BTRFS: The Linux B-Tree Filesystem journal August 2013
Soft error vulnerability of iterative linear algebra methods conference January 2008
Chisel: reliability- and accuracy-aware optimization of approximate computational kernels journal December 2014
Toward resilient algorithms and applications conference January 2013
Verifying quantitative reliability for programs that execute on unreliable hardware journal November 2013
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
The future of microprocessors journal May 2011
Co-array Fortran for parallel programming journal August 1998
The OpenMC Monte Carlo particle transport code journal January 2013
An overview of the Trilinos project journal September 2005
The Linux implementation of a log-structured file system journal July 2006
The Use of Triple-Modular Redundancy to Improve Computer Reliability journal April 1962

Cited By (3)

Failure Recovery in Resilient X10 journal July 2019
Language Support for Reliable Memory Regions preprint January 2016
Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction journal February 2021

Similar Records

Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience
Journal Article · Thu Sep 08 00:00:00 EDT 2016 · International Journal of High Performance Computing Applications · OSTI ID:1196863

Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience
Conference · Thu Jan 01 00:00:00 EST 2015 · OSTI ID:1196863

Final Report, “Exploiting Global View for Resilience”
Technical Report · Wed Mar 29 00:00:00 EDT 2017 · OSTI ID:1196863