Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Report on local data recovery approaches suitable for weather and climate prediction (Deliverable 1.3) (V.1.0)

Technical Report ·
DOI:https://doi.org/10.2172/1607968· OSTI ID:1607968
 [1];  [1];  [2];  [3];  [4];  [5];  [6];  [2];  [7];  [8];  [9]
  1. Politecnico di Milano (Italy)
  2. Univ. of Stuttgart (Germany)
  3. Imperial College, London (United Kingdom)
  4. European Centre for Medium Range Weather Forecasts, Reading, (United Kingdom); Univ. of Oxford (United Kingdom)
  5. Loughborough Univ. (United Kingdom)
  6. HiePACS, Talence (France)
  7. Bull (ATOS), Bezons (France)
  8. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
  9. European Centre for Medium Range Weather Forecasts, Reading, (United Kingdom)

Numerical weather and climate prediction rates as one of the scientific applications whose accuracy improvements greatly depend on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increasing average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys approaches for fault-tolerance in numerical algorithms and system resilience in parallel simulations from the perspective of numerical weather and climate prediction systems. A selection of existing strategies is analyzed, featuring interpolation-restart and compressed checkpointing for the numerics, in-memory checkpointing, user-level failure mitigation-based and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analyzed and some recommendations outlined for future developments.

Research Organization:
Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); European Union Horizon 2020; Deutsche Forschungsgemeinschaft
DOE Contract Number:
AC04-94AL85000; NA0003525
OSTI ID:
1607968
Report Number(s):
SAND--2020-3622R; 685048
Country of Publication:
United States
Language:
English

Similar Records

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
Journal Article · Sun Feb 07 23:00:00 EST 2021 · International Journal of High Performance Computing Applications · OSTI ID:1770801

Redundant computing for exascale systems.
Technical Report · Tue Nov 30 23:00:00 EST 2010 · OSTI ID:1011662

Implementing Software Resiliency in HPX for Extreme Scale Computing
Technical Report · Wed Apr 15 00:00:00 EDT 2020 · OSTI ID:1614897

Related Subjects