The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many-core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm-Based Fault Tolerance (ABFT) using Adaptive Mesh Refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solu- tion may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables such as positivity or boundedness may be violated during interpolation. These challenges can be addressed by the combination of two techniques: 1. a fault-tolerant MPI implementation to recover from runtime node failures, and 2. high-order interpolation schemes to preserve the physical solution and reconstruct lost data. Here, the approach considered here uses a "Limited Essentially Non-Oscillatory" (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault-tolerant MPI - ULFM to recover from runtime failure, and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10x faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.
Sahasrabudhe, Damodar, et al. "Node failure resiliency for Uintah without checkpointing." Concurrency and Computation. Practice and Experience, vol. 31, no. 20, Jun. 2019. https://doi.org/10.1002/cpe.5340
Sahasrabudhe, Damodar, Berzins, Martin, & Schmidt, John (2019). Node failure resiliency for Uintah without checkpointing. Concurrency and Computation. Practice and Experience, 31(20). https://doi.org/10.1002/cpe.5340
Sahasrabudhe, Damodar, Berzins, Martin, and Schmidt, John, "Node failure resiliency for Uintah without checkpointing," Concurrency and Computation. Practice and Experience 31, no. 20 (2019), https://doi.org/10.1002/cpe.5340
@article{osti_1637354,
author = {Sahasrabudhe, Damodar and Berzins, Martin and Schmidt, John},
title = {Node failure resiliency for Uintah without checkpointing},
annote = {The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many-core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm-Based Fault Tolerance (ABFT) using Adaptive Mesh Refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solu- tion may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables such as positivity or boundedness may be violated during interpolation. These challenges can be addressed by the combination of two techniques: 1. a fault-tolerant MPI implementation to recover from runtime node failures, and 2. high-order interpolation schemes to preserve the physical solution and reconstruct lost data. Here, the approach considered here uses a "Limited Essentially Non-Oscillatory" (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault-tolerant MPI - ULFM to recover from runtime failure, and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10x faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.},
doi = {10.1002/cpe.5340},
url = {https://www.osti.gov/biblio/1637354},
journal = {Concurrency and Computation. Practice and Experience},
issn = {ISSN 1532-0626},
number = {20},
volume = {31},
place = {United States},
publisher = {Wiley},
year = {2019},
month = {06}}
USDOE National Nuclear Security Administration (NNSA); National Science Foundation
Grant/Contract Number:
NA0002375
OSTI ID:
1637354
Journal Information:
Concurrency and Computation. Practice and Experience, Journal Name: Concurrency and Computation. Practice and Experience Journal Issue: 20 Vol. 31; ISSN 1532-0626
Holmen, John K.; Humphrey, Alan; Sunderland, Daniel
PEARC17: Practice and Experience in Advanced Research Computing 2017, Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impacthttps://doi.org/10.1145/3093338.3093388
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1109/sc.2012.77
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1109/SC.2012.46
Dauwe, Daniel; Pasricha, Sudeep; Maciejewski, Anthony A.
2017 IEEE International Parallel and Distributed Processing Symposium: Workshops (IPDPSW), 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)https://doi.org/10.1109/IPDPSW.2017.41
HPDC'13: The 22nd International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 22nd international symposium on High-performance parallel and distributed computinghttps://doi.org/10.1145/2462902.2462920
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13https://doi.org/10.1145/2503210.2503250
2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processinghttps://doi.org/10.1109/IPDPS.2013.69
Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17https://doi.org/10.1145/3126908.3126937
2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)https://doi.org/10.1109/DSNW.2012.6264677
Kumar, Sidharth; Vishwanath, Venkatram; Carns, Philip
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1109/SC.2012.54