Node failure resiliency for Uintah without checkpointing

Sahasrabudhe, Damodar; Berzins, Martin; Schmidt, John

doi:10.1002/cpe.5340

Node failure resiliency for Uintah without checkpointing

Journal Article · Sun Jun 02 00:00:00 EDT 2019 · Concurrency and Computation. Practice and Experience

DOI:https://doi.org/10.1002/cpe.5340· OSTI ID:1637354

^[1]; Berzins, Martin ^[2]; Schmidt, John ^[2]

Univ. of Utah, Salt Lake City, UT (United States); University of Utah
Univ. of Utah, Salt Lake City, UT (United States)

The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many-core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm-Based Fault Tolerance (ABFT) using Adaptive Mesh Refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solu- tion may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables such as positivity or boundedness may be violated during interpolation. These challenges can be addressed by the combination of two techniques: 1. a fault-tolerant MPI implementation to recover from runtime node failures, and 2. high-order interpolation schemes to preserve the physical solution and reconstruct lost data. Here, the approach considered here uses a "Limited Essentially Non-Oscillatory" (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault-tolerant MPI - ULFM to recover from runtime failure, and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10x faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.

View Accepted Manuscript (DOE)

Research Organization:: Univ. of Utah, Salt Lake City, UT (United States)

Sponsoring Organization:: National Science Foundation; USDOE National Nuclear Security Administration (NNSA)

Grant/Contract Number:: NA0002375

OSTI ID:: 1637354

Journal Information:: Concurrency and Computation. Practice and Experience, Journal Name: Concurrency and Computation. Practice and Experience Journal Issue: 20 Vol. 31; ISSN 1532-0626

Publisher:: WileyCopyright Statement

Country of Publication:: United States

Language:: English

References (49)

Uniformly High Order Accurate Essentially Non-oscillatory Schemes, III Harten, Ami; Engquist, Bjorn; Osher, Stanley Journal of Computational Physics, Vol. 131, Issue 1 https://doi.org/10.1006/jcph.1996.5632	journal	February 1997
A Cell-Centered Adaptive Projection Method for the Incompressible Euler Equations Martin, Daniel F.; Colella, Phillip Journal of Computational Physics, Vol. 163, Issue 2 https://doi.org/10.1006/jcph.2000.6575	journal	September 2000
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World Fagg, Graham E.; Dongarra, Jack J. Recent Advances in Parallel Virtual Machine and Message Passing Interface https://doi.org/10.1007/3-540-45255-9_47	book	January 2000
Fault Tolerance Techniques for High-Performance Computing Dongarra, Jack; Herault, Thomas; Robert, Yves Computer Communications and Networks https://doi.org/10.1007/978-3-319-20943-2_1	book	January 2015
High Order ENO and WENO Schemes for Computational Fluid Dynamics Shu, Chi-Wang High-Order Methods for Computational Physics https://doi.org/10.1007/978-3-662-03882-6_5	book	January 1999
Uniformly high order accurate essentially non-oscillatory schemes, III Harten, Ami; Engquist, Bjorn; Osher, Stanley Journal of Computational Physics, Vol. 71, Issue 2 https://doi.org/10.1016/0021-9991(87)90031-3	journal	August 1987
A study of numerical methods for hyperbolic conservation laws with stiff source terms Leveque, R. J.; Yee, H. C. Journal of Computational Physics, Vol. 86, Issue 1 https://doi.org/10.1016/0021-9991(90)90097-K	journal	January 1990
On spatial adaptivity and interpolation when using the method of lines Berzins, Martin; Capon, Philip J.; Jimack, Peter K. Applied Numerical Mathematics, Vol. 26, Issue 1-2 https://doi.org/10.1016/S0168-9274(97)00091-3	journal	January 1998
On the history of multivariate polynomial interpolation Gasca, Mariano; Sauer, Thomas Journal of Computational and Applied Mathematics, Vol. 122, Issue 1-2 https://doi.org/10.1016/S0377-0427(00)00353-8	journal	October 2000
MOL solvers for hyperbolic PDEs with source terms Ahmad, I.; Berzins, M. Mathematics and Computers in Simulation, Vol. 56, Issue 2 https://doi.org/10.1016/S0378-4754(01)00284-1	journal	May 2001
Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations Shi, Xuanhua; Pazat, Jean-Louis; Rodriguez, Eric Future Generation Computer Systems, Vol. 26, Issue 2 https://doi.org/10.1016/j.future.2009.07.015	journal	February 2010
A node-centered local refinement algorithm for Poisson's equation in complex geometries McCorquodale, Peter; Colella, Phillip; Grote, David P. Journal of Computational Physics, Vol. 201, Issue 1 https://doi.org/10.1016/j.jcp.2004.04.022	journal	November 2004
Berkeley lab checkpoint/restart (BLCR) for Linux clusters Hargrove, Paul H.; Duell, Jason C. Journal of Physics: Conference Series, Vol. 46 https://doi.org/10.1088/1742-6596/46/1/067	journal	September 2006
A Performance and Energy Comparison of Fault Tolerance Techniques for Exascale Computing Systems Dauwe, Daniel; Pasricha, Sudeep; Maciejewski, Anthony A. 2016 IEEE International Conference on Computer and Information Technology (CIT) https://doi.org/10.1109/CIT.2016.44	conference	December 2016
PIDX: Efficient Parallel I/O for Multi-resolution Multi-dimensional Scientific Datasets Kumar, Sidharth; Vishwanath, Venkatram; Carns, Philip 2011 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2011.19	conference	September 2011
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K. 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.62	conference	June 2014
A scalable double in-memory checkpoint and restart scheme towards exascale Zheng, Gengbin; Kale, Laxmikant V. 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264677	conference	June 2012
Reducing Network Congestion and Synchronization Overhead During Aggregation of Hierarchical Data Kumar, Sidharth; Hoang, Duong; Petruzza, Steve 2017 IEEE 24th International Conference on High Performance Computing (HiPC) https://doi.org/10.1109/HiPC.2017.00034	conference	December 2017
Hybrid Checkpointing for MPI Jobs in HPC Environments Wang, Chao; Mueller, Frank; Engelmann, Christian 2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS) https://doi.org/10.1109/ICPADS.2010.48	conference	December 2010
Resilience for Stencil Computations with Latent Errors Fang, Aiman; Cavelan, Aurelien; Robert, Yves 2017 46th International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2017.67	conference	August 2017
Compiler-enhanced incremental checkpointing for OpenMP applications Bronevetsky, Greg; Marques, Daniel; Pingali, Keshav Distributed Processing (IPDPS), 2009 IEEE International Symposium on Parallel & Distributed Processing https://doi.org/10.1109/IPDPS.2009.5160999	conference	May 2009
Improving the performance of Uintah: A large-scale adaptive meshing computational framework Luitjens, Justin; Berzins, Martin 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS) https://doi.org/10.1109/IPDPS.2010.5470437	conference	April 2010
Optimizing Checkpoints Using NVM as Virtual Memory Kannan, Sudarsun; Gavrilovska, Ada; Schwan, Karsten 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.69	conference	May 2013
Radiative Heat Transfer Calculation on 16384 GPUs Using a Reverse Monte Carlo Ray Tracing Approach with Adaptive Mesh Refinement Humphrey, Alan; Sunderland, Daniel; Harman, Todd 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2016.93	conference	May 2016
An Analysis of Resilience Techniques for Exascale Computing Platforms Dauwe, Daniel; Pasricha, Sudeep; Maciejewski, Anthony A. 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2017.41	conference	May 2017
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes Bosilca, G.; Bouteiller, A.; Cappello, F. ACM/IEEE SC 2002 Conference (SC'02) https://doi.org/10.1109/SC.2002.10048	conference	January 2002
Design and modeling of a non-blocking checkpointing system Sato, Kento; Maruyama, Naoya; Mohror, Kathryn 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.46	conference	November 2012
Efficient data restructuring and aggregation for I/O acceleration in PIDX Kumar, Sidharth; Vishwanath, Venkatram; Carns, Philip 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.54	conference	November 2012
Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications Dubey, Anshu; Fujita, Hajime; Graves, Daniel T. SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.41	conference	November 2016
Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities Hussain, Zaeem; Znati, Taieb; Melhem, Rami SC18: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2018.00047	conference	November 2018
Algorithm-Based Fault Tolerance for Matrix Operations No authors listed IEEE Transactions on Computers, Vol. C-33, Issue 6 https://doi.org/10.1109/TC.1984.1676475	journal	June 1984
MCREngine: A scalable checkpointing system using data-aware aggregation and compression Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/sc.2012.77	conference	November 2012
Adaptive Polynomial Interpolation on Evenly Spaced Meshes Berzins, M. SIAM Review, Vol. 49, Issue 4 https://doi.org/10.1137/050625667	journal	January 2007
Extending the Uintah Framework through the Petascale Modeling of Detonation in Arrays of High Explosive Devices Berzins, Martin; Beckvermit, Jacqueline; Harman, Todd SIAM Journal on Scientific Computing, Vol. 38, Issue 5 https://doi.org/10.1137/15M1023270	journal	January 2016
Resilience for Massively Parallel Multigrid Solvers Huber, Markus; Gmeiner, Björn; Rüde, Ulrich SIAM Journal on Scientific Computing, Vol. 38, Issue 5 https://doi.org/10.1137/15M1026122	journal	January 2016
Scalable, fault tolerant membership for MPI tasks on HPC systems Varma, Jyothish; Wang, Chao; Mueller, Frank Proceedings of the 20th annual international conference on Supercomputing - ICS '06 https://doi.org/10.1145/1183401.1183433	conference	January 2006
Compiler-enhanced incremental checkpointing for OpenMP applications Bronevetsky, Greg; Marques, Daniel J.; Pingali, Keshav K. Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming - PPoPP '08 https://doi.org/10.1145/1345206.1345253	conference	January 2008
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems Dong, Xiangyu; Muralimanohar, Naveen; Jouppi, Norm Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09 https://doi.org/10.1145/1654059.1654117	conference	January 2009
High performance linpack benchmark: a fault tolerant implementation without checkpointing Davies, Teresa; Karlsson, Christer; Liu, Hui Proceedings of the international conference on Supercomputing - ICS '11 https://doi.org/10.1145/1995896.1995923	conference	January 2011
Correcting soft errors online in LU factorization Davies, Teresa; Chen, Zizhong HPDC'13: The 22nd International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing https://doi.org/10.1145/2462902.2462920	conference	October 2018
Fault tolerance using lower fidelity data in adaptive mesh applications Dubey, Anshu; Mohapatra, Prateeti; Weide, Klaus Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13 https://doi.org/10.1145/2465813.2465817	conference	January 2013
Correcting soft errors online in LU factorization Davies, Teresa; Chen, Zizhong Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13 https://doi.org/10.1145/2493123.2462920	conference	January 2013
Investigating applications portability with the Uintah DAG-based runtime system on PetaScale supercomputers Meng, Qingyu; Humphrey, Alan; Schmidt, John Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503250	conference	January 2013
Improving Uintah's Scalability Through the Use of Portable Kokkos-Based Data Parallel Tasks Holmen, John K.; Humphrey, Alan; Sunderland, Daniel PEARC17: Practice and Experience in Advanced Research Computing 2017, Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact https://doi.org/10.1145/3093338.3093388	conference	July 2017
Failures in large scale systems: long-term measurement, analysis, and implications Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937	conference	January 2017
Addressing Global Data Dependencies in Heterogeneous Asynchronous Runtime Systems on GPUs Peterson, Brad; Humphrey, Alan; Schmidt, John Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware - ESPM2'17 https://doi.org/10.1145/3152041.3152082	conference	January 2017
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh Scientific Programming, Vol. 21, Issue 3-4 https://doi.org/10.1155/2013/341672	journal	January 2013
Preserving Nonnegativity in Discontinuous Galerkin Approximations to Scalar Transport via Truncation and Mass Aware Rescaling (TMAR) Light, Devin; Durran, Dale Monthly Weather Review, Vol. 144, Issue 12 https://doi.org/10.1175/MWR-D-16-0220.1	journal	November 2016
Exploring versioned distributed arrays for resilience in scientific applications: global view resilience Chien, A.; Balaji, P.; Dun, N. The International Journal of High Performance Computing Applications, Vol. 31, Issue 6 https://doi.org/10.1177/1094342016664796	journal	September 2016