Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Journal Article · · International Journal of High Performance Computing Applications
 [1];  [1];  [2];  [3];  [4];  [5];  [6];  [2];  [7];  [8];  [9]
  1. Politecnico di Milano (Italy)
  2. Univ. of Stuttgart (Germany)
  3. Imperial College, London (United Kingdom)
  4. European Centre for Medium Range Weather Forecast, Reading (United Kingdom); Univ. of Oxford (United Kingdom)
  5. Loughborough Univ. (United Kingdom)
  6. HiePACS, Talence (France)
  7. Center for Excellence in Performance Programming (CEPP), Rennes (France)
  8. Sandia National Lab. (SNL-CA), Livermore, CA (United States)
  9. European Centre for Medium Range Weather Forecast, Reading (United Kingdom)

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.

Research Organization:
Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); European Research Council (ERC); German Research Foundation (DFG)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1770801
Report Number(s):
SAND--2021-1552J; 694544
Journal Information:
International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 4 Vol. 35; ISSN 1094-3420
Publisher:
SAGECopyright Statement
Country of Publication:
United States
Language:
English

References (120)

A semi-implicit, semi-Lagrangian discontinuous Galerkin framework for adaptive numerical weather prediction: SISL-DG Framework for Adaptive NWP journal May 2015
A comprehensive framework for verification, validation, and uncertainty quantification in scientific computing journal June 2011
Local rollback for resilient MPI applications with application-level checkpointing and message logging journal February 2019
MPDATA: An edge-based unstructured-grid formulation journal July 2005
Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing journal November 2015
Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience journal January 2015
Assessing the scales in numerical weather and climate predictions: will exascale be the rescue?
  • Neumann, Philipp; Düben, Peter; Adamidis, Panagiotis
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 377, Issue 2142 https://doi.org/10.1098/rsta.2018.0148
journal February 2019
Software-based replication for fault tolerance journal April 1997
A tutorial on CRC computations journal August 1988
Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer
  • Bautista-Gomez, Leonardo; Zyulkyarov, Ferad; Unsal, Osman
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/sc.2016.54
conference November 2016
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM conference January 2014
The Use of Triple-Modular Redundancy to Improve Computer Reliability journal April 1962
A Blended Soundproof-to-Compressible Numerical Model for Small- to Mesoscale Atmospheric Dynamics journal December 2014
A nonhydrostatic unstructured-mesh soundproof model for simulation of internal gravity waves journal September 2011
Exploiting Data Representation for Fault Tolerance text January 2013
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale text January 2017
HOMMEXX 1.0: a performance-portable atmospheric dynamical core for the Energy Exascale Earth System Model journal January 2019
FVM 1.0: a nonhydrostatic finite-volume dynamical core for the IFS journal January 2019
An approach to secure weather and climate models against hardware faults: HARDWARE FAULTS IN EARTH SYSTEM MODELS journal February 2017
CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications: CPPC: COMPILER-ASSISTED PORTABLE CHECKPOINTING
  • Rodríguez, Gabriel; Martín, María J.; González, Patricia
  • Concurrency and Computation: Practice and Experience, Vol. 22, Issue 6 https://doi.org/10.1002/cpe.1541
journal November 2009
Numerical recovery strategies for parallel resilient Krylov linear solvers: RESILIENCY IN KRYLOV LINEAR SOLVERS journal August 2016
On the probabilistic skill of dual‐resolution ensemble forecasts journal December 2019
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World book January 2000
Fault Tolerance Techniques for High-Performance Computing book January 2015
Rounding errors may be beneficial for simulations of atmospheric flow: results from the forced 1D Burgers equation journal June 2015
Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes journal July 2013
Reliable low precision simulations in land surface models journal December 2017
An evaluation of User-Level Failure Mitigation support in MPI journal May 2013
Review of numerical methods for nonhydrostatic weather prediction models journal January 2003
A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers journal July 2018
Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization journal August 2017
Service replication taxonomy in distributed environments journal January 2016
Current and Emerging Time-Integration Strategies in Global Numerical Weather and Climate Prediction journal February 2018
A higher order estimate of the optimum checkpoint interval for restart dumps journal February 2006
Fault tolerance of MPI applications in exascale systems: The ULFM solution journal May 2020
Review of code and solution verification procedures for computational simulation journal May 2005
The use of imprecise processing to improve accuracy in weather & climate prediction journal August 2014
Exploiting data representation for fault tolerance journal May 2016
LFRic: Meeting the challenges of scalability and performance portability in Weather and Climate models journal October 2019
Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner journal April 2018
A survey of techniques for improving error-resilience of DRAM journal November 2018
Choosing the Optimal Numerical Precision for Data Assimilation in the Presence of Model Error journal September 2018
The DOE E3SM Coupled Model Version 1: Overview and Evaluation at Standard Resolution journal July 2019
A Baseline for Global Weather and Climate Simulations at 1 km Resolution journal October 2020
The quiet revolution of numerical weather prediction journal September 2015
On the use of inexact, pruned hardware in atmospheric modelling
  • Düben, Peter D.; Joven, Jaume; Lingamneni, Avinash
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 372, Issue 2018 https://doi.org/10.1098/rsta.2013.0276
journal June 2014
Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets conference December 2018
On the Need for Reproducible Numerical Accuracy through Intelligent Runtime Selection of Reduction Algorithms at the Extreme Scale conference September 2015
Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers conference September 2017
Quantifying I/O and Communication Traffic Interference on Dragonfly Networks Equipped with Burst Buffers conference September 2017
Verifying Qthreads: Is Model Checking Viable for User Level Tasking Runtimes? conference November 2018
Characterizing and Modeling Reliability of Declustered RAID for HPC Storage Systems conference June 2019
Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
  • Hassani, Amin; Skjellum, Anthony; Brightwell, Ron
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.78
conference June 2014
Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs
  • Martino, Catello Di; Kramer, William; Kalbarczyk, Zbigniew
  • 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2015.50
conference June 2015
Modeling Soft-Error Propagation in Programs
  • Li, Guanpeng; Pattabiraman, Karthik; Hari, Siva Kumar Sastry
  • 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2018.00016
conference June 2018
Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System conference November 2020
Architectures and Precision Analysis for Modelling Atmospheric Variables with Chaotic Behaviour
  • Russell, Francis P.; Duben, Peter D.; Niu, Xinyu
  • 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) https://doi.org/10.1109/FCCM.2015.52
conference May 2015
Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory conference February 2015
Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators
  • Oliveira, Daniel Alfonso Goncalves De; Pilla, Laercio Lima; Hanzich, Mauricio
  • 2017 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) https://doi.org/10.1109/HPCA.2017.41
conference February 2017
On the Resilience of Parallel Sparse Hybrid Solvers conference December 2015
Optimizing Checkpoints Using NVM as Virtual Memory
  • Kannan, Sudarsun; Gavrilovska, Ada; Schwan, Karsten
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.69
conference May 2013
Evaluating the Impact of SDC on the GMRES Iterative Solver
  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.123
conference May 2014
Fast Error-Bounded Lossy HPC Data Compression with SZ conference May 2016
Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization conference May 2017
The Case of Performance Variability on Dragonfly-based Systems conference May 2020
Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver conference May 2014
Cyclic Codes for Error Detection journal January 1961
Reflecting on the Goal and Baseline for Exascale Computing: A Roadmap Based on Weather and Climate Simulations journal January 2019
Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery conference March 2018
A High-Level C++ Approach to Manage Local Errors, Asynchrony and Faults in an MPI Application conference March 2018
Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection
  • Cher, Chen-Yong; Gupta, Meeta S.; Bose, Pradip
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.53
conference November 2014
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
  • Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78
conference November 2014
Watch Out for the Bully! Job Interference Study on Dragonfly Network
  • Yang, Xu; Jenkins, John; Mubarak, Misbah
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.63
conference November 2016
Parallel Reproducible Summation journal July 2015
Assessment of the Impact of Cosmic-Ray-Induced Neutrons on Hardware in the Roadrunner Supercomputer journal June 2012
Basic concepts and taxonomy of dependable and secure computing journal January 2004
A Large-Scale Study of Failures in High-Performance Computing Systems journal October 2010
A Survey of Techniques for Modeling and Improving Reliability of Computing Systems journal April 2016
A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems journal May 2016
Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales journal October 2017
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment journal January 2008
Solution of Sparse Indefinite Systems of Linear Equations journal September 1975
GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems journal July 1986
Fault Resilient Domain Decomposition Preconditioner for PDEs journal January 2015
Resilience for Massively Parallel Multigrid Solvers journal January 2016
Interpolation-Restart Strategies for Resilient Eigensolvers journal January 2016
Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping journal January 2017
Shoestring: probabilistic soft error reliability on the cheap journal March 2010
Stochastic computing: embracing errors in architectureand design of processors and applications
  • Sartori, John; Sloan, Joseph; Kumar, Rakesh
  • Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems - CASES '11 https://doi.org/10.1145/2038698.2038720
conference January 2011
FTI: high performance fault tolerance interface for hybrid systems
  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427
conference January 2011
Algorithmic methodologies for ultra-efficient inexact architectures for sustaining technology scaling conference January 2012
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
  • Li, Dong; Chen, Zizhong; Wu, Panruo
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503226
conference January 2013
There goes the neighborhood: performance degradation due to nearby jobs
  • Bhatele, Abhinav; Mohror, Kathryn; Langer, Steven H.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503247
conference January 2013
ACR: automatic checkpoint/restart for soft and hard error protection
  • Ni, Xiang; Meneses, Esteban; Jain, Nikhil
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503266
conference January 2013
Self-stabilizing iterative solvers conference January 2013
Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications
  • Berrocal, Eduardo; Bautista-Gomez, Leonardo; Di, Sheng
  • Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15 https://doi.org/10.1145/2749246.2749253
conference January 2015
A Numerical Soft Fault Model for Iterative Linear Solvers
  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15 https://doi.org/10.1145/2749246.2749254
conference January 2015
Detecting Silent Data Corruption for Extreme-Scale MPI Applications
  • Bautista-Gomez, Leonardo; Cappello, Franck
  • EuroMPI '15: The 22nd European MPI Users' Group Meeting, Proceedings of the 22nd European MPI Users' Group Meeting https://doi.org/10.1145/2802658.2802665
conference September 2015
Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing
  • Inadomi, Yuichi; Patki, Tapasya; Inoue, Koji
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807638
conference November 2015
Frugal ECC: efficient and versatile memory error protection through fine-grained compression
  • Kim, Jungrae; Sullivan, Michael; Gong, Seong-Lyong
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807659
conference November 2015
Towards a More Complete Understanding of SDC Propagation
  • Calhoun, Jon; Snir, Marc; Olson, Luke N.
  • HPDC '17: The 26th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3078597.3078617
conference June 2017
Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale conference January 2017
Failures in large scale systems: long-term measurement, analysis, and implications
  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126937
conference January 2017
Improving performance of iterative methods by lossy checkponting conference January 2018
Multi-Level Analysis of Compiler-Induced Variability and Performance Tradeoffs
  • Bentley, Michael; Briggs, Ian; Gopalakrishnan, Ganesh
  • HPDC '19: The 28th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/3307681.3325960
conference June 2019
Porting the COSMO Weather Model to Manycore CPUs
  • Thaler, Felix; Moosbrugger, Stefan; Osuna, Carlos
  • PASC '19: Platform for Advanced Scientific Computing Conference, Proceedings of the Platform for Advanced Scientific Computing Conference https://doi.org/10.1145/3324989.3325723
conference June 2019
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression journal January 2013
A Blended Soundproof-to-Compressible Numerical Model for Small- to Mesoscale Atmospheric Dynamics journal December 2014
Benchmark Tests for Numerical Weather Forecasts on Inexact Hardware journal September 2014
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities journal July 2009
Addressing failures in exascale computing journal March 2014
Evaluating and extending user-level fault tolerance in MPI applications journal July 2016
Soft fault detection and correction for multigrid journal February 2017
Partial differential equations preconditioner resilient to soft and hard faults journal January 2017
A scalable and extensible checkpointing scheme for massively parallel simulations journal May 2018
Global Simulations of the Atmosphere at 1.45 km Grid-Spacing with the Integrated Forecasting System journal January 2020
Near-global climate simulation at 1 km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 journal January 2018
The ESCAPE project: Energy-efficient Scalable Algorithms for Weather Prediction at Exascale journal January 2019
The ESCAPE project: Energy-efficient Scalable Algorithms for Weather Prediction at Exascale dataset January 2019
The ESCAPE project: Energy-efficient Scalable Algorithms for Weather Prediction at Exascale dataset January 2019

Cited By (1)