Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Node failure resiliency for Uintah without checkpointing

Journal Article · · Concurrency and Computation. Practice and Experience
DOI:https://doi.org/10.1002/cpe.5340· OSTI ID:1637354
 [1];  [2];  [2]
  1. Univ. of Utah, Salt Lake City, UT (United States); University of Utah
  2. Univ. of Utah, Salt Lake City, UT (United States)

The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many-core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm-Based Fault Tolerance (ABFT) using Adaptive Mesh Refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solu- tion may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables such as positivity or boundedness may be violated during interpolation. These challenges can be addressed by the combination of two techniques: 1. a fault-tolerant MPI implementation to recover from runtime node failures, and 2. high-order interpolation schemes to preserve the physical solution and reconstruct lost data. Here, the approach considered here uses a "Limited Essentially Non-Oscillatory" (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault-tolerant MPI - ULFM to recover from runtime failure, and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10x faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.

Research Organization:
Univ. of Utah, Salt Lake City, UT (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); National Science Foundation
Grant/Contract Number:
NA0002375
OSTI ID:
1637354
Journal Information:
Concurrency and Computation. Practice and Experience, Journal Name: Concurrency and Computation. Practice and Experience Journal Issue: 20 Vol. 31; ISSN 1532-0626
Publisher:
WileyCopyright Statement
Country of Publication:
United States
Language:
English

References (49)

Addressing Global Data Dependencies in Heterogeneous Asynchronous Runtime Systems on GPUs
  • Peterson, Brad; Humphrey, Alan; Schmidt, John
  • Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware - ESPM2'17 https://doi.org/10.1145/3152041.3152082
conference January 2017
Fault Tolerance Techniques for High-Performance Computing book January 2015
Radiative Heat Transfer Calculation on 16384 GPUs Using a Reverse Monte Carlo Ray Tracing Approach with Adaptive Mesh Refinement conference May 2016
Compiler-enhanced incremental checkpointing for OpenMP applications conference May 2009
Adaptive Polynomial Interpolation on Evenly Spaced Meshes journal January 2007
Improving Uintah's Scalability Through the Use of Portable Kokkos-Based Data Parallel Tasks
  • Holmen, John K.; Humphrey, Alan; Sunderland, Daniel
  • PEARC17: Practice and Experience in Advanced Research Computing 2017, Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact https://doi.org/10.1145/3093338.3093388
conference July 2017
Improving the performance of Uintah: A large-scale adaptive meshing computational framework conference April 2010
Berkeley lab checkpoint/restart (BLCR) for Linux clusters journal September 2006
A node-centered local refinement algorithm for Poisson's equation in complex geometries journal November 2004
MCREngine: A scalable checkpointing system using data-aware aggregation and compression
  • Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/sc.2012.77
conference November 2012
Correcting soft errors online in LU factorization conference January 2013
A Performance and Energy Comparison of Fault Tolerance Techniques for Exascale Computing Systems conference December 2016
Scalable, fault tolerant membership for MPI tasks on HPC systems conference January 2006
Hybrid Checkpointing for MPI Jobs in HPC Environments conference December 2010
On spatial adaptivity and interpolation when using the method of lines journal January 1998
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes conference January 2002
Design and modeling of a non-blocking checkpointing system
  • Sato, Kento; Maruyama, Naoya; Mohror, Kathryn
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.46
conference November 2012
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
High Order ENO and WENO Schemes for Computational Fluid Dynamics book January 1999
An Analysis of Resilience Techniques for Exascale Computing Platforms
  • Dauwe, Daniel; Pasricha, Sudeep; Maciejewski, Anthony A.
  • 2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2017.41
conference May 2017
High performance linpack benchmark: a fault tolerant implementation without checkpointing conference January 2011
Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities conference November 2018
Resilience for Stencil Computations with Latent Errors conference August 2017
A study of numerical methods for hyperbolic conservation laws with stiff source terms journal January 1990
Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations journal February 2010
On the history of multivariate polynomial interpolation journal October 2000
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression journal January 2013
Fault tolerance using lower fidelity data in adaptive mesh applications conference January 2013
Uniformly High Order Accurate Essentially Non-oscillatory Schemes, III journal February 1997
Resilience for Massively Parallel Multigrid Solvers journal January 2016
Preserving Nonnegativity in Discontinuous Galerkin Approximations to Scalar Transport via Truncation and Mass Aware Rescaling (TMAR) journal November 2016
Correcting soft errors online in LU factorization
  • Davies, Teresa; Chen, Zizhong
  • HPDC'13: The 22nd International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 22nd international symposium on High-performance parallel and distributed computing https://doi.org/10.1145/2462902.2462920
conference October 2018
Compiler-enhanced incremental checkpointing for OpenMP applications
  • Bronevetsky, Greg; Marques, Daniel J.; Pingali, Keshav K.
  • Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming - PPoPP '08 https://doi.org/10.1145/1345206.1345253
conference January 2008
PIDX: Efficient Parallel I/O for Multi-resolution Multi-dimensional Scientific Datasets conference September 2011
Exploring versioned distributed arrays for resilience in scientific applications: global view resilience journal September 2016
Investigating applications portability with the Uintah DAG-based runtime system on PetaScale supercomputers
  • Meng, Qingyu; Humphrey, Alan; Schmidt, John
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503250
conference January 2013
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems conference January 2009
Uniformly high order accurate essentially non-oscillatory schemes, III journal August 1987
Optimizing Checkpoints Using NVM as Virtual Memory
  • Kannan, Sudarsun; Gavrilovska, Ada; Schwan, Karsten
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.69
conference May 2013
Failures in large scale systems: long-term measurement, analysis, and implications
  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937
conference January 2017
A scalable double in-memory checkpoint and restart scheme towards exascale
  • Zheng, Gengbin; Kale, Laxmikant V.
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264677
conference June 2012
Reducing Network Congestion and Synchronization Overhead During Aggregation of Hierarchical Data conference December 2017
Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications
  • Dubey, Anshu; Fujita, Hajime; Graves, Daniel T.
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.41
conference November 2016
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World book January 2000
Efficient data restructuring and aggregation for I/O acceleration in PIDX
  • Kumar, Sidharth; Vishwanath, Venkatram; Carns, Philip
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.54
conference November 2012
MOL solvers for hyperbolic PDEs with source terms journal May 2001
Extending the Uintah Framework through the Petascale Modeling of Detonation in Arrays of High Explosive Devices journal January 2016
Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters
  • Martino, Catello Di; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.62
conference June 2014
A Cell-Centered Adaptive Projection Method for the Incompressible Euler Equations journal September 2000