skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience

Abstract

Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and when it is most productive, and customize for each application structure independently. This control is portable, and its embedding in application source makes it natural to express and easy to maintain. The ability to name multiple versions and “partially materialize” them efficiently makes ambitious forward-recovery based on “data slices” across versions or data structures both easy to express and efficient. Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programming effort to add resilience. The required changes are small ( < 2% lines of code (LOC)), localized and machine-independent, and perhaps most important, require no software architecture changes. We also measure the overhead of adding GVR versioning and show that overheads < 2% are generally achieved. This overhead suggests that GVR can be implemented in large-scale codes and support portable error recovery with modest investment and runtime impact. Our results aremore » drawn from both IBM BG/Q and Cray XC30 experiments, demonstrating portability. We also present two case studies of flexible error recovery, illustrating how GVR can be used for multi-version rollback recovery, and several different forward-recovery schemes. GVR’s multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient. Our results suggest that GVR is scalable, portable, and efficient. GVR interfaces are flexible, supporting a variety of recovery schemes, and altogether GVR embodies a gentle-slope path to tolerate growing error rates in future extreme-scale systems.« less

Authors:
 [1];  [2];  [1];  [3];  [1];  [2];  [3];  [4];  [5];  [6];  [6];  [7];  [7];  [8];  [8];  [8];  [2]
  1. Univ. of Chicago, Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States)
  2. Argonne National Lab. (ANL), Argonne, IL (United States)
  3. Univ. of Chicago, Chicago, IL (United States)
  4. HP Vertica, Cambridge, MA (United States)
  5. Intel Corp., Santa Clara, CA (United States)
  6. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
  7. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  8. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Argonne National Lab. (ANL), Argonne, IL (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1333611
Alternate Identifier(s):
OSTI ID: 1440004; OSTI ID: 1466280
Report Number(s):
SAND-2016-7908J
Journal ID: ISSN 1094-3420; 646619
Grant/Contract Number:  
AC04-94AL85000; AC02-05CH11231; SC0008603; AC02-06CH11357
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Name: International Journal of High Performance Computing Applications; Journal ID: ISSN 1094-3420
Publisher:
SAGE
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; resilience; fault-tolerance; exascale; scalable computing; application-based fault tolerance

Citation Formats

Chien, Andrew A., Balaji, Pavan, Dun, Nan, Fang, Aiman, Fujita, Hajime, Iskra, Kamil, Rubenstein, Zachary, Zheng, Ziming, Hammond, Jeff, Laguna, Ignacio, Richards, David F., Dubey, Anshu, van Straalen, Brian, Hoemmen, Mark Frederick, Heroux, Michael A., Teranishi, Keita, and Siegel, Andrew R. Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience. United States: N. p., 2016. Web. doi:10.1177/1094342016664796.
Chien, Andrew A., Balaji, Pavan, Dun, Nan, Fang, Aiman, Fujita, Hajime, Iskra, Kamil, Rubenstein, Zachary, Zheng, Ziming, Hammond, Jeff, Laguna, Ignacio, Richards, David F., Dubey, Anshu, van Straalen, Brian, Hoemmen, Mark Frederick, Heroux, Michael A., Teranishi, Keita, & Siegel, Andrew R. Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience. United States. https://doi.org/10.1177/1094342016664796
Chien, Andrew A., Balaji, Pavan, Dun, Nan, Fang, Aiman, Fujita, Hajime, Iskra, Kamil, Rubenstein, Zachary, Zheng, Ziming, Hammond, Jeff, Laguna, Ignacio, Richards, David F., Dubey, Anshu, van Straalen, Brian, Hoemmen, Mark Frederick, Heroux, Michael A., Teranishi, Keita, and Siegel, Andrew R. Thu . "Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience". United States. https://doi.org/10.1177/1094342016664796. https://www.osti.gov/servlets/purl/1333611.
@article{osti_1333611,
title = {Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience},
author = {Chien, Andrew A. and Balaji, Pavan and Dun, Nan and Fang, Aiman and Fujita, Hajime and Iskra, Kamil and Rubenstein, Zachary and Zheng, Ziming and Hammond, Jeff and Laguna, Ignacio and Richards, David F. and Dubey, Anshu and van Straalen, Brian and Hoemmen, Mark Frederick and Heroux, Michael A. and Teranishi, Keita and Siegel, Andrew R.},
abstractNote = {Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and when it is most productive, and customize for each application structure independently. This control is portable, and its embedding in application source makes it natural to express and easy to maintain. The ability to name multiple versions and “partially materialize” them efficiently makes ambitious forward-recovery based on “data slices” across versions or data structures both easy to express and efficient. Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programming effort to add resilience. The required changes are small ( < 2% lines of code (LOC)), localized and machine-independent, and perhaps most important, require no software architecture changes. We also measure the overhead of adding GVR versioning and show that overheads < 2% are generally achieved. This overhead suggests that GVR can be implemented in large-scale codes and support portable error recovery with modest investment and runtime impact. Our results are drawn from both IBM BG/Q and Cray XC30 experiments, demonstrating portability. We also present two case studies of flexible error recovery, illustrating how GVR can be used for multi-version rollback recovery, and several different forward-recovery schemes. GVR’s multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient. Our results suggest that GVR is scalable, portable, and efficient. GVR interfaces are flexible, supporting a variety of recovery schemes, and altogether GVR embodies a gentle-slope path to tolerate growing error rates in future extreme-scale systems.},
doi = {10.1177/1094342016664796},
url = {https://www.osti.gov/biblio/1333611}, journal = {International Journal of High Performance Computing Applications},
issn = {1094-3420},
number = ,
volume = ,
place = {United States},
year = {2016},
month = {9}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 1 work
Citation information provided by
Web of Science

Figures / Tables:

Figure 1 Figure 1: Distributed Arrays and Versioning in GVR

Save / Share:

Works referenced in this record:

Fault tolerance using lower fidelity data in adaptive mesh applications
conference, January 2013


The Linux implementation of a log-structured file system
journal, July 2006


Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984


Adaptive mesh refinement for hyperbolic partial differential equations
journal, March 1984


EnerJ: approximate data types for safe and general low-power computation
conference, January 2011

  • Sampson, Adrian; Dietl, Werner; Fortuna, Emily
  • Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation - PLDI '11
  • https://doi.org/10.1145/1993498.1993518

The future of microprocessors
journal, May 2011


Local adaptive mesh refinement for shock hydrodynamics
journal, May 1989


Initial MCNP6 Release Overview
journal, December 2012


Fail-stop processors: an approach to designing fault-tolerant computing systems
journal, August 1983


The Use of Triple-Modular Redundancy to Improve Computer Reliability
journal, April 1962


The OpenMC Monte Carlo particle transport code
journal, January 2013


Simulating solidification in metals at high pressure: The drive to petascale computing
journal, September 2006


Chisel: reliability- and accuracy-aware optimization of approximate computational kernels
conference, January 2014

  • Misailovic, Sasa; Carbin, Michael; Achour, Sara
  • Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications - OOPSLA '14
  • https://doi.org/10.1145/2660193.2660231

A first order approximation to the optimum checkpoint interval
journal, September 1974


X10: an object-oriented approach to non-uniform cluster computing
conference, January 2005

  • Charles, Philippe; Grothoff, Christian; Saraswat, Vijay
  • Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA '05
  • https://doi.org/10.1145/1094811.1094852

Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
conference, January 2013


Reliability Issues in Computing System Design
journal, June 1978


The university of Florida sparse matrix collection
journal, November 2011


Challenges and Prospects for Whole-Core Monte Carlo Analysis
journal, March 2012


FTI: high performance fault tolerance interface for hybrid systems
conference, January 2011

  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • https://doi.org/10.1145/2063384.2063427

Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
journal, May 2006


An evaluation of difference and threshold techniques for efficient checkpoints
conference, June 2012

  • Hogan, Sean; Hammond, Jeff R.; Chien, Andrew A.
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
  • https://doi.org/10.1109/DSNW.2012.6264674

A higher order estimate of the optimum checkpoint interval for restart dumps
journal, February 2006


When is multi-version checkpointing needed?
conference, January 2013


A Flexible Inner-Outer Preconditioned GMRES Algorithm
journal, March 1993


Preventive Migration vs. Preventive Checkpointing for Extreme Scale Supercomputers
journal, June 2011


Evaluating the viability of process replication reliability for exascale systems
conference, January 2011

  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • https://doi.org/10.1145/2063384.2063443

Design of ion-implanted MOSFET's with very small physical dimensions
journal, October 1974


Dark silicon and the end of multicore scaling
conference, January 2011


Berkeley lab checkpoint/restart (BLCR) for Linux clusters
journal, September 2006


An overview of the Trilinos project
journal, September 2005


ISABELA for effective in situ compression of scientific data: ISABELA FOR EFFECTIVE
journal, July 2012

  • Lakshminarasimhan, Sriram; Shah, Neil; Ethier, Stephane
  • Concurrency and Computation: Practice and Experience, Vol. 25, Issue 4
  • https://doi.org/10.1002/cpe.2887

Multidimensional upwind methods for hyperbolic conservation laws
journal, March 1990


Verifying quantitative reliability for programs that execute on unreliable hardware
conference, January 2013

  • Carbin, Michael; Misailovic, Sasa; Rinard, Martin C.
  • Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications - OOPSLA '13
  • https://doi.org/10.1145/2509136.2509546

Addressing failures in exascale computing
journal, March 2014


Toward Exascale Resilience
journal, September 2009


The effect of load imbalances on the performance of Monte Carlo algorithms in LWR analysis
journal, February 2013


Parallel Programmability and the Chapel Language
journal, August 2007


BTRFS: The Linux B-Tree Filesystem
journal, August 2013


Co-array Fortran for parallel programming
journal, August 1998


Data decomposition of Monte Carlo particle transport simulations via tally servers
journal, November 2013


On the Combination of Silent Error Detection and Checkpointing
conference, December 2013


    Works referencing / citing this record:

    Application health monitoring for extreme‐scale resiliency using cooperative fault management
    journal, July 2019


    Application health monitoring for extreme‐scale resiliency using cooperative fault management
    journal, July 2019


    Node failure resiliency for Uintah without checkpointing
    journal, June 2019

    • Sahasrabudhe, Damodar; Berzins, Martin; Schmidt, John
    • Concurrency and Computation: Practice and Experience, Vol. 31, Issue 20
    • https://doi.org/10.1002/cpe.5340