skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience

Journal Article · · International Journal of High Performance Computing Applications
 [1];  [2];  [1];  [3];  [1];  [2];  [3];  [4];  [5];  [6];  [6];  [7];  [7];  [8];  [8];  [8];  [2]
  1. Univ. of Chicago, Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States)
  2. Argonne National Lab. (ANL), Argonne, IL (United States)
  3. Univ. of Chicago, Chicago, IL (United States)
  4. HP Vertica, Cambridge, MA (United States)
  5. Intel Corp., Santa Clara, CA (United States)
  6. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  7. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  8. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and when it is most productive, and customize for each application structure independently. This control is portable, and its embedding in application source makes it natural to express and easy to maintain. The ability to name multiple versions and “partially materialize” them efficiently makes ambitious forward-recovery based on “data slices” across versions or data structures both easy to express and efficient. Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programming effort to add resilience. The required changes are small (< 2% lines of code (LOC)), localized and machine-independent, and perhaps most important, require no software architecture changes. We also measure the overhead of adding GVR versioning and show that overheads < 2% are generally achieved. This overhead suggests that GVR can be implemented in large-scale codes and support portable error recovery with modest investment and runtime impact. Our results are drawn from both IBM BG/Q and Cray XC30 experiments, demonstrating portability. We also present two case studies of flexible error recovery, illustrating how GVR can be used for multi-version rollback recovery, and several different forward-recovery schemes. GVR’s multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient. Lastly, our results suggest that GVR is scalable, portable, and efficient. GVR interfaces are flexible, supporting a variety of recovery schemes, and altogether GVR embodies a gentle-slope path to tolerate growing error rates in future extreme-scale systems.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC04-94AL85000; AC52-07NA27344; AC02-05CH11231; SC0008603; AC02-06CH11357
OSTI ID:
1333611
Alternate ID(s):
OSTI ID: 1440004; OSTI ID: 1466280; OSTI ID: 1811742
Report Number(s):
SAND-2016-7908J; LLNL-JRNL-822995; 646619
Journal Information:
International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications; ISSN 1094-3420
Publisher:
SAGECopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 5 works
Citation information provided by
Web of Science

References (44)

Fault tolerance using lower fidelity data in adaptive mesh applications conference January 2013
The Linux implementation of a log-structured file system journal July 2006
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
Adaptive mesh refinement for hyperbolic partial differential equations journal March 1984
EnerJ: approximate data types for safe and general low-power computation
  • Sampson, Adrian; Dietl, Werner; Fortuna, Emily
  • Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation - PLDI '11 https://doi.org/10.1145/1993498.1993518
conference January 2011
The future of microprocessors journal May 2011
Local adaptive mesh refinement for shock hydrodynamics journal May 1989
Initial MCNP6 Release Overview journal December 2012
Fail-stop processors: an approach to designing fault-tolerant computing systems journal August 1983
The Use of Triple-Modular Redundancy to Improve Computer Reliability journal April 1962
The OpenMC Monte Carlo particle transport code journal January 2013
Simulating solidification in metals at high pressure: The drive to petascale computing journal September 2006
Chisel: reliability- and accuracy-aware optimization of approximate computational kernels
  • Misailovic, Sasa; Carbin, Michael; Achour, Sara
  • Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications - OOPSLA '14 https://doi.org/10.1145/2660193.2660231
conference January 2014
A first order approximation to the optimum checkpoint interval journal September 1974
X10: an object-oriented approach to non-uniform cluster computing
  • Charles, Philippe; Grothoff, Christian; Saraswat, Vijay
  • Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA '05 https://doi.org/10.1145/1094811.1094852
conference January 2005
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods conference January 2013
Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic report August 2013
Reliability Issues in Computing System Design journal June 1978
The university of Florida sparse matrix collection journal November 2011
Challenges and Prospects for Whole-Core Monte Carlo Analysis journal March 2012
FTI: high performance fault tolerance interface for hybrid systems
  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427
conference January 2011
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit journal May 2006
An evaluation of difference and threshold techniques for efficient checkpoints
  • Hogan, Sean; Hammond, Jeff R.; Chien, Andrew A.
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264674
conference June 2012
The incomplete Cholesky—conjugate gradient method for the iterative solution of systems of linear equations journal January 1978
A higher order estimate of the optimum checkpoint interval for restart dumps journal February 2006
When is multi-version checkpointing needed? conference January 2013
A Flexible Inner-Outer Preconditioned GMRES Algorithm journal March 1993
Preventive Migration vs. Preventive Checkpointing for Extreme Scale Supercomputers journal June 2011
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
Design of ion-implanted MOSFET's with very small physical dimensions journal October 1974
Dark silicon and the end of multicore scaling conference January 2011
Berkeley lab checkpoint/restart (BLCR) for Linux clusters journal September 2006
HPCG Benchmark Technical Specification report October 2013
An overview of the Trilinos project journal September 2005
ISABELA for effective in situ compression of scientific data: ISABELA FOR EFFECTIVE
  • Lakshminarasimhan, Sriram; Shah, Neil; Ethier, Stephane
  • Concurrency and Computation: Practice and Experience, Vol. 25, Issue 4 https://doi.org/10.1002/cpe.2887
journal July 2012
Verifying quantitative reliability for programs that execute on unreliable hardware
  • Carbin, Michael; Misailovic, Sasa; Rinard, Martin C.
  • Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications - OOPSLA '13 https://doi.org/10.1145/2509136.2509546
conference January 2013
Addressing failures in exascale computing journal March 2014
Toward Exascale Resilience journal September 2009
The effect of load imbalances on the performance of Monte Carlo algorithms in LWR analysis journal February 2013
Parallel Programmability and the Chapel Language journal August 2007
BTRFS: The Linux B-Tree Filesystem journal August 2013
Co-array Fortran for parallel programming journal August 1998
Data decomposition of Monte Carlo particle transport simulations via tally servers journal November 2013
On the Combination of Silent Error Detection and Checkpointing conference December 2013

Cited By (2)

Application health monitoring for extreme‐scale resiliency using cooperative fault management journal July 2019
Node failure resiliency for Uintah without checkpointing
  • Sahasrabudhe, Damodar; Berzins, Martin; Schmidt, John
  • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 20 https://doi.org/10.1002/cpe.5340
journal June 2019

Figures / Tables (24)