Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience

Chien, Andrew A.; Balaji, Pavan; Dun, Nan; Fang, Aiman; Fujita, Hajime; Iskra, Kamil; Rubenstein, Zachary; Zheng, Ziming; Hammond, Jeff; Laguna, Ignacio; Richards, David F.; Dubey, Anshu; van Straalen, Brian; Hoemmen, Mark Frederick; Heroux, Michael A.; Teranishi, Keita; Siegel, Andrew R.

doi:10.1177/1094342016664796

Title: Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience

Journal Article · Thu Sep 08 00:00:00 EDT 2016 · International Journal of High Performance Computing Applications

DOI:https://doi.org/10.1177/1094342016664796· OSTI ID:1333611

Chien, Andrew A. ^[1]; Balaji, Pavan ^[2]; Dun, Nan ^[1]; Fang, Aiman ^[3]; Fujita, Hajime ^[1]; Iskra, Kamil ^[2]; Rubenstein, Zachary ^[3]; Zheng, Ziming ^[4]; Hammond, Jeff ^[5]; Laguna, Ignacio ^[6]; Richards, David F. ^[6]; Dubey, Anshu ^[7]; van Straalen, Brian ^[7]; Hoemmen, Mark Frederick ^[8]; Heroux, Michael A. ^[8]; Teranishi, Keita ^[8]; Siegel, Andrew R. ^[2]

Univ. of Chicago, Chicago, IL (United States); Argonne National Lab. (ANL), Argonne, IL (United States)
Argonne National Lab. (ANL), Argonne, IL (United States)
Univ. of Chicago, Chicago, IL (United States)
HP Vertica, Cambridge, MA (United States)
Intel Corp., Santa Clara, CA (United States)
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and when it is most productive, and customize for each application structure independently. This control is portable, and its embedding in application source makes it natural to express and easy to maintain. The ability to name multiple versions and “partially materialize” them efficiently makes ambitious forward-recovery based on “data slices” across versions or data structures both easy to express and efficient. Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programming effort to add resilience. The required changes are small (< 2% lines of code (LOC)), localized and machine-independent, and perhaps most important, require no software architecture changes. We also measure the overhead of adding GVR versioning and show that overheads < 2% are generally achieved. This overhead suggests that GVR can be implemented in large-scale codes and support portable error recovery with modest investment and runtime impact. Our results are drawn from both IBM BG/Q and Cray XC30 experiments, demonstrating portability. We also present two case studies of flexible error recovery, illustrating how GVR can be used for multi-version rollback recovery, and several different forward-recovery schemes. GVR’s multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient. Lastly, our results suggest that GVR is scalable, portable, and efficient. GVR interfaces are flexible, supporting a variety of recovery schemes, and altogether GVR embodies a gentle-slope path to tolerate growing error rates in future extreme-scale systems.

View Accepted Manuscript (DOE)

Cite

Export

Save

Research Organization:: Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Argonne National Laboratory (ANL), Argonne, IL (United States); Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Grant/Contract Number:: AC04-94AL85000; AC52-07NA27344; AC02-05CH11231; SC0008603; AC02-06CH11357

OSTI ID:: 1333611

Alternate ID(s):: OSTI ID: 1440004; OSTI ID: 1466280; OSTI ID: 1811742

Report Number(s):: SAND-2016-7908J; LLNL-JRNL-822995; 646619

Journal Information:: International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications; ISSN 1094-3420

Publisher:: SAGECopyright Statement

Country of Publication:: United States

Language:: English

Citation Metrics:

Cited by: 5 works

Citation information provided by
Web of Science

References (44)

Fault tolerance using lower fidelity data in adaptive mesh applications Dubey, Anshu; Mohapatra, Prateeti; Weide, Klaus Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13 https://doi.org/10.1145/2465813.2465817	conference	January 2013
The Linux implementation of a log-structured file system Konishi, Ryusuke; Amagai, Yoshiji; Sato, Koji ACM SIGOPS Operating Systems Review, Vol. 40, Issue 3 https://doi.org/10.1145/1151374.1151375	journal	July 2006
Algorithm-Based Fault Tolerance for Matrix Operations IEEE Transactions on Computers, Vol. C-33, Issue 6 https://doi.org/10.1109/TC.1984.1676475	journal	June 1984
Adaptive mesh refinement for hyperbolic partial differential equations Berger, Marsha J.; Oliger, Joseph Journal of Computational Physics, Vol. 53, Issue 3 https://doi.org/10.1016/0021-9991(84)90073-1	journal	March 1984
EnerJ: approximate data types for safe and general low-power computation Sampson, Adrian; Dietl, Werner; Fortuna, Emily Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation - PLDI '11 https://doi.org/10.1145/1993498.1993518	conference	January 2011
The future of microprocessors Borkar, Shekhar; Chien, Andrew A. Communications of the ACM, Vol. 54, Issue 5 https://doi.org/10.1145/1941487.1941507	journal	May 2011
Local adaptive mesh refinement for shock hydrodynamics Berger, M. J.; Colella, P. Journal of Computational Physics, Vol. 82, Issue 1 https://doi.org/10.1016/0021-9991(89)90035-1	journal	May 1989
Initial MCNP6 Release Overview Goorley, T.; James, M.; Booth, T. Nuclear Technology, Vol. 180, Issue 3 https://doi.org/10.13182/NT11-135	journal	December 2012
Fail-stop processors: an approach to designing fault-tolerant computing systems Schlichting, Richard D.; Schneider, Fred B. ACM Transactions on Computer Systems, Vol. 1, Issue 3 https://doi.org/10.1145/357369.357371	journal	August 1983
The Use of Triple-Modular Redundancy to Improve Computer Reliability Lyons, R. E.; Vanderkulk, W. IBM Journal of Research and Development, Vol. 6, Issue 2 https://doi.org/10.1147/rd.62.0200	journal	April 1962
The OpenMC Monte Carlo particle transport code Romano, Paul K.; Forget, Benoit Annals of Nuclear Energy, Vol. 51 https://doi.org/10.1016/j.anucene.2012.06.040	journal	January 2013
Simulating solidification in metals at high pressure: The drive to petascale computing Streitz, Frederick H.; Glosli, James N.; Patel, Mehul V. Journal of Physics: Conference Series, Vol. 46 https://doi.org/10.1088/1742-6596/46/1/037	journal	September 2006
Chisel: reliability- and accuracy-aware optimization of approximate computational kernels Misailovic, Sasa; Carbin, Michael; Achour, Sara Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications - OOPSLA '14 https://doi.org/10.1145/2660193.2660231	conference	January 2014
A first order approximation to the optimum checkpoint interval Young, John W. Communications of the ACM, Vol. 17, Issue 9 https://doi.org/10.1145/361147.361115	journal	September 1974
X10: an object-oriented approach to non-uniform cluster computing Charles, Philippe; Grothoff, Christian; Saraswat, Vijay Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA '05 https://doi.org/10.1145/1094811.1094852	conference	January 2005
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods Chen, Zizhong Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13 https://doi.org/10.1145/2442516.2442533	conference	January 2013
Quantifying the Impact of Single Bit Flips on Floating Point Arithmetic Elliott, James J.; Mueller, Frank; Stoyanov, Miroslav K. https://doi.org/10.2172/1089338	report	August 2013
Reliability Issues in Computing System Design Randell, B.; Lee, P.; Treleaven, P. C. ACM Computing Surveys, Vol. 10, Issue 2 https://doi.org/10.1145/356725.356729	journal	June 1978
The university of Florida sparse matrix collection Davis, Timothy A.; Hu, Yifan ACM Transactions on Mathematical Software, Vol. 38, Issue 1 https://doi.org/10.1145/2049662.2049663	journal	November 2011
Challenges and Prospects for Whole-Core Monte Carlo Analysis Martin, William R. Nuclear Engineering and Technology, Vol. 44, Issue 2 https://doi.org/10.5516/NET.01.2012.502	journal	March 2012
FTI: high performance fault tolerance interface for hybrid systems Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427	conference	January 2011
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit Nieplocha, Jarek; Palmer, Bruce; Tipparaju, Vinod The International Journal of High Performance Computing Applications, Vol. 20, Issue 2 https://doi.org/10.1177/1094342006064503	journal	May 2006
An evaluation of difference and threshold techniques for efficient checkpoints Hogan, Sean; Hammond, Jeff R.; Chien, Andrew A. 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264674	conference	June 2012
The incomplete Cholesky—conjugate gradient method for the iterative solution of systems of linear equations Kershaw, David S. Journal of Computational Physics, Vol. 26, Issue 1 https://doi.org/10.1016/0021-9991(78)90098-0	journal	January 1978
A higher order estimate of the optimum checkpoint interval for restart dumps Daly, J. T. Future Generation Computer Systems, Vol. 22, Issue 3, p. 303-312 https://doi.org/10.1016/j.future.2004.11.016	journal	February 2006
When is multi-version checkpointing needed? Lu, Guoming; Zheng, Ziming; Chien, Andrew A. Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13 https://doi.org/10.1145/2465813.2465821	conference	January 2013
A Flexible Inner-Outer Preconditioned GMRES Algorithm Saad, Youcef SIAM Journal on Scientific Computing, Vol. 14, Issue 2 https://doi.org/10.1137/0914028	journal	March 1993
Preventive Migration vs. Preventive Checkpointing for Extreme Scale Supercomputers Cappello, Franck; Casanova, Henri; Robert, Yves Parallel Processing Letters, Vol. 21, Issue 02 https://doi.org/10.1142/S0129626411000126	journal	June 2011
Evaluating the viability of process replication reliability for exascale systems Ferreira, Kurt; Stearley, Jon; Laros, James H. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443	conference	January 2011
Design of ion-implanted MOSFET's with very small physical dimensions Dennard, R. H.; Gaensslen, F. H.; Rideout, V. L. IEEE Journal of Solid-State Circuits, Vol. 9, Issue 5 https://doi.org/10.1109/JSSC.1974.1050511	journal	October 1974
Dark silicon and the end of multicore scaling Esmaeilzadeh, Hadi; Blem, Emily; St. Amant, Renee Proceeding of the 38th annual international symposium on Computer architecture - ISCA '11 https://doi.org/10.1145/2000064.2000108	conference	January 2011
Berkeley lab checkpoint/restart (BLCR) for Linux clusters Hargrove, Paul H.; Duell, Jason C. Journal of Physics: Conference Series, Vol. 46 https://doi.org/10.1088/1742-6596/46/1/067	journal	September 2006
HPCG Benchmark Technical Specification Heroux, Michael; Dongarra, Jack; Luszczek, Piotr https://doi.org/10.2172/1113870	report	October 2013
An overview of the Trilinos project Heroux, Michael A.; Phipps, Eric T.; Salinger, Andrew G. ACM Transactions on Mathematical Software, Vol. 31, Issue 3 https://doi.org/10.1145/1089014.1089021	journal	September 2005
ISABELA for effective in situ compression of scientific data: ISABELA FOR EFFECTIVE Lakshminarasimhan, Sriram; Shah, Neil; Ethier, Stephane Concurrency and Computation: Practice and Experience, Vol. 25, Issue 4 https://doi.org/10.1002/cpe.2887	journal	July 2012
Verifying quantitative reliability for programs that execute on unreliable hardware Carbin, Michael; Misailovic, Sasa; Rinard, Martin C. Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications - OOPSLA '13 https://doi.org/10.1145/2509136.2509546	conference	January 2013
Addressing failures in exascale computing Snir, Marc; Wisniewski, Robert W.; Abraham, Jacob A. The International Journal of High Performance Computing Applications, Vol. 28, Issue 2 https://doi.org/10.1177/1094342014522573	journal	March 2014
Toward Exascale Resilience Cappello, Franck; Geist, Al; Gropp, Bill The International Journal of High Performance Computing Applications, Vol. 23, Issue 4 https://doi.org/10.1177/1094342009347767	journal	September 2009
The effect of load imbalances on the performance of Monte Carlo algorithms in LWR analysis Siegel, A. R.; Smith, K.; Romano, P. K. Journal of Computational Physics, Vol. 235 https://doi.org/10.1016/j.jcp.2012.06.012	journal	February 2013
Parallel Programmability and the Chapel Language Chamberlain, B. L.; Callahan, D.; Zima, H. P. The International Journal of High Performance Computing Applications, Vol. 21, Issue 3 https://doi.org/10.1177/1094342007078442	journal	August 2007
BTRFS: The Linux B-Tree Filesystem Rodeh, Ohad; Bacik, Josef; Mason, Chris ACM Transactions on Storage, Vol. 9, Issue 3 https://doi.org/10.1145/2501620.2501623	journal	August 2013
Co-array Fortran for parallel programming Numrich, Robert W.; Reid, John ACM SIGPLAN Fortran Forum, Vol. 17, Issue 2 https://doi.org/10.1145/289918.289920	journal	August 1998
Data decomposition of Monte Carlo particle transport simulations via tally servers Romano, Paul K.; Siegel, Andrew R.; Forget, Benoit Journal of Computational Physics, Vol. 252 https://doi.org/10.1016/j.jcp.2013.06.011	journal	November 2013
On the Combination of Silent Error Detection and Checkpointing Aupy, Guillaume; Benoit, Anne; Herault, Thomas 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing (PRDC) https://doi.org/10.1109/PRDC.2013.10	conference	December 2013

Cited By (2)

Application health monitoring for extreme‐scale resiliency using cooperative fault management Agarwal, Pratul K.; Naughton, Thomas; Park, Byung H. Concurrency and Computation: Practice and Experience, Vol. 32, Issue 2 https://doi.org/10.1002/cpe.5449	journal	July 2019
Node failure resiliency for Uintah without checkpointing Sahasrabudhe, Damodar; Berzins, Martin; Schmidt, John Concurrency and Computation: Practice and Experience, Vol. 31, Issue 20 https://doi.org/10.1002/cpe.5340	journal	June 2019