Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.
Benacchio, Tommaso, et al. "Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction." International Journal of High Performance Computing Applications, vol. 35, no. 4, Feb. 2021. https://doi.org/10.1177/1094342021990433
Benacchio, Tommaso, Bonaventura, Luca, Altenbernd, Mirco, Cantwell, Chris D., Düben, Peter D., Gillard, Mike, Giraud, Luc, Göddeke, Dominik, Raffin, Erwan, Teranishi, Keita, & Wedi, Nils (2021). Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction. International Journal of High Performance Computing Applications, 35(4). https://doi.org/10.1177/1094342021990433
Benacchio, Tommaso, Bonaventura, Luca, Altenbernd, Mirco, et al., "Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction," International Journal of High Performance Computing Applications 35, no. 4 (2021), https://doi.org/10.1177/1094342021990433
@article{osti_1770801,
author = {Benacchio, Tommaso and Bonaventura, Luca and Altenbernd, Mirco and Cantwell, Chris D. and Düben, Peter D. and Gillard, Mike and Giraud, Luc and Göddeke, Dominik and Raffin, Erwan and Teranishi, Keita and others},
title = {Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction},
annote = {Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.},
doi = {10.1177/1094342021990433},
url = {https://www.osti.gov/biblio/1770801},
journal = {International Journal of High Performance Computing Applications},
issn = {ISSN 1094-3420},
number = {4},
volume = {35},
place = {United States},
publisher = {SAGE},
year = {2021},
month = {02}}
Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); European Research Council (ERC); German Research Foundation (DFG)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1770801
Report Number(s):
SAND--2021-1552J; 694544
Journal Information:
International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 4 Vol. 35; ISSN 1094-3420
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 377, Issue 2142https://doi.org/10.1098/rsta.2018.0148
Düben, Peter D.; Joven, Jaume; Lingamneni, Avinash
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 372, Issue 2018https://doi.org/10.1098/rsta.2013.0276
Oliveira, Daniel Alfonso Goncalves De; Pilla, Laercio Lima; Hanzich, Mauricio
2017 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)https://doi.org/10.1109/HPCA.2017.41
2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processinghttps://doi.org/10.1109/IPDPS.2013.69
Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems - CASES '11https://doi.org/10.1145/2038698.2038720
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11https://doi.org/10.1145/2063384.2063427
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13https://doi.org/10.1145/2503210.2503226
Bhatele, Abhinav; Mohror, Kathryn; Langer, Steven H.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13https://doi.org/10.1145/2503210.2503247
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13https://doi.org/10.1145/2503210.2503266
SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1145/2807591.2807638
Kim, Jungrae; Sullivan, Michael; Gong, Seong-Lyong
SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1145/2807591.2807659
HPDC '17: The 26th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computinghttps://doi.org/10.1145/3078597.3078617
Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17https://doi.org/10.1145/3126908.3126937
HPDC '19: The 28th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computinghttps://doi.org/10.1145/3307681.3325960
PASC '19: Platform for Advanced Scientific Computing Conference, Proceedings of the Platform for Advanced Scientific Computing Conferencehttps://doi.org/10.1145/3324989.3325723