Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Resiliency in numerical algorithm design for extreme scale simulations

Journal Article · · International Journal of High Performance Computing Applications
 [1];  [2];  [3];  [4];  [5];  [5];  [6];  [7];  [8];  [9];  [6];  [10];  [11];  [12];  [1];  [2];  [10];  [13];  [10];  [14] more »;  [15];  [2];  [16];  [6];  [17];  [18];  [19];  [6];  [20];  [21];  [20];  [22];  [15];  [10];  [6];  [6] « less
  1. National Institute for Research in Digital Science and Technology (Inria), Rocquencourt (France)
  2. Univ. of Stuttgart (Germany)
  3. Karlsruher Institute of Technology (Germany)
  4. Barcelona Supercomputing Center (Spain)
  5. Polytechnic Univ. of Milan (Italy)
  6. Technical Univ. of Munich (Germany)
  7. NVIDIA Corporation, Santa Clara, CA (United States)
  8. Univ. of Basel (Switzerland)
  9. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
  10. Univ. of Erlangen, Nuremberg (Germany)
  11. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  12. Univ. of Vienna (Austria)
  13. Paris-Pantheon-Assas Univ., Paris (France)
  14. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  15. Univ. of Bordeaux (France)
  16. Cerfacs, Toulouse (France)
  17. Polytechnic Univ. of Valencia (UPV) (Spain)
  18. NexGen Analytics, Sheridan, WY (United States)
  19. Univ. of Erlangen, Nuremberg (Germany); Cerfacs, Toulouse (France)
  20. Australian National Univ., Canberra, ACT (Australia)
  21. Forschungszentrum Jülich GmbH (Germany)
  22. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Here this work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States); Los Alamos National Laboratory (LANL), Los Alamos, NM (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1855669
Journal Information:
International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 2 Vol. 36; ISSN 1094-3420
Publisher:
SAGECopyright Statement
Country of Publication:
United States
Language:
English

References (203)

Parallel adaptive FETI‐DP using lightweight asynchronous dynamic load balancing
  • Klawonn, Axel; Kühn, Martin J.; Rheinbach, Oliver
  • International Journal for Numerical Methods in Engineering, Vol. 121, Issue 4 https://doi.org/10.1002/nme.6237
journal October 2019
Distributed asynchronous computation of fixed points journal September 1983
Multivariate Quadrature on Adaptive Sparse Grids journal August 2003
Chaotic relaxation journal April 1969
Algorithm-based fault tolerance applied to high performance computing journal April 2009
ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability journal January 2012
The Lanczos and conjugate gradient algorithms in finite precision arithmetic journal May 2006
Berkeley lab checkpoint/restart (BLCR) for Linux clusters journal September 2006
Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
  • Hassani, Amin; Skjellum, Anthony; Brightwell, Ron
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/dsn.2014.78
conference June 2014
Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes conference November 2019
Evaluating the Impact of SDC on the GMRES Iterative Solver
  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/ipdps.2014.123
conference May 2014
Dynamic load balancing and efficient load estimators for asynchronous iterative algorithms journal April 2005
On Soft Errors in the Conjugate Gradient Method: Sensitivity and Robust Numerical Detection journal January 2020
Regression with the optimised combination technique conference January 2006
Fault Tolerance in the Parareal Method
  • Nielsen, Allan S.; Hesthaven, Jan S.
  • HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale https://doi.org/10.1145/2909428.2909431
conference May 2016
PapyrusKV: a high-performance parallel key-value store for distributed NVM architectures
  • Kim, Jungwon; Lee, Seyong; Vetter, Jeffrey S.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126943
conference January 2017
Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System report April 2010
A dimension adaptive sparse grid combination technique for machine learning journal April 2007
A Pattern Language for High-Performance Computing Resilience text January 2017
rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Parallel Independent Tasks preprint January 2019
Algorithm-Based Fault Tolerance for Parallel Stencil Computations preprint January 2019
CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications: CPPC: COMPILER-ASSISTED PORTABLE CHECKPOINTING
  • Rodríguez, Gabriel; Martín, María J.; González, Patricia
  • Concurrency and Computation: Practice and Experience, Vol. 22, Issue 6 https://doi.org/10.1002/cpe.1541
journal November 2009
Numerical recovery strategies for parallel resilient Krylov linear solvers: RESILIENCY IN KRYLOV LINEAR SOLVERS journal August 2016
A method of finite element tearing and interconnecting and its parallel solution algorithm journal October 1991
A semi-implicit, semi-Lagrangian discontinuous Galerkin framework for adaptive numerical weather prediction: SISL-DG Framework for Adaptive NWP journal May 2015
Asynchronous Iterative Algorithms with Flexible Communication for Nonlinear Network Flow Problems journal October 1996
Multirate linear multistep methods journal December 1984
Distributed asynchronous computation of fixed points journal September 1983
Asynchronous optimized Schwarz methods with and without overlap journal March 2017
Dimension?Adaptive Tensor?Product Quadrature journal August 2003
A multirate time stepping strategy for stiff ordinary differential equations journal November 2006
A conservative implicit multirate method for hyperbolic problems journal August 2018
A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers journal July 2018
On asynchronous iterations journal November 2000
Parallel asynchronous algorithms: A survey journal November 2020
The GeoClaw software for depth-averaged flows with adaptive refinement journal September 2011
A two-scale approach for efficient on-the-fly operator assembly in massively parallel high performance multigrid codes journal December 2017
A self adjusting multirate algorithm for robust time discretization of partial differential equations journal April 2020
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing journal October 2015
Local rollback for resilient MPI applications with application-level checkpointing and message logging journal February 2019
Comparison between adaptive and uniform discontinuous Galerkin simulations in dry 2D bubble experiments journal February 2013
An efficient parallel implementation of explicit multirate Runge–Kutta schemes for discontinuous Galerkin computations journal January 2014
Scalable and fault tolerant orthogonalization based on randomized distributed data aggregation journal November 2013
Fine-grained bit-flip protection for relaxation methods journal September 2019
Large-scale simulation of mantle convection based on a new matrix-free approach journal February 2019
Symmetric active/active metadata service for high availability parallel file systems journal December 2009
Proactive process-level live migration and back migration in HPC environments journal February 2012
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns journal December 2014
Fault tolerant communication-optimal 2.5D matrix multiplication journal June 2017
Fault-tolerant least squares solvers for wireless sensor networks based on gossiping journal February 2020
Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing journal November 2015
Toward fault-tolerant parallel-in-time integration with PFASST journal February 2017
Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner journal April 2018
Performance of asynchronous optimized Schwarz with one-sided communication journal August 2019
Fault Tolerance Properties of Gossip-Based Distributed Orthogonal Iteration Methods journal January 2013
Sparse grids journal May 2004
The Lanczos and conjugate gradient algorithms in finite precision arithmetic journal May 2006
Tsunami modelling with adaptively refined finite volume methods journal April 2011
Discrete Stochastic Arithmetic for Validating Results of Numerical Software journal December 2004
Stochastic subspace correction methods and fault tolerance journal August 2019
Anisotropic mesh adaptivity for multi-scale ocean modelling
  • Piggott, M. D.; Farrell, P. E.; Wilson, C. R.
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 367, Issue 1907 https://doi.org/10.1098/rsta.2009.0155
journal November 2009
Error detection by duplicated instructions in super-scalar processors journal March 2002
Verificarlo: Checking Floating Point Accuracy through Monte Carlo Arithmetic conference July 2016
Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets conference December 2018
Dynamic Malleability in Iterative MPI Applications conference May 2007
Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications conference May 2017
Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations conference May 2019
Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets conference May 2019
SWIFT: Software Implemented Fault Tolerance conference January 2005
A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets conference September 2016
An ABFT Scheme Based on Communication Characteristics conference September 2016
MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection conference September 2017
Algorithm-Based Fault Tolerance for Parallel Stencil Computations conference September 2019
Towards End-to-end SDC Detection for HPC Applications Equipped with Lossy Compression conference September 2020
Debugging and Optimization of HPC Programs with the Verrou Tool conference November 2019
FlipSphere: A Software-Based DRAM Error Detection and Correction Library for HPC
  • Fiala, David; Mueller, Frank; Ferreira, Kurt B.
  • 2016 IEEE/ACM 20th International Symposium on Distributed Simulation and Real Time Applications (DS-RT) https://doi.org/10.1109/DS-RT.2016.27
conference September 2016
A fault tolerant approach to microprocessor design conference January 2001
Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
  • Hassani, Amin; Skjellum, Anthony; Brightwell, Ron
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.78
conference June 2014
Does partial replication pay off?
  • Stearley, Jon; Ferreira, Kurt; Robinson, David
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264669
conference June 2012
ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
  • Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264672
conference June 2012
A scalable double in-memory checkpoint and restart scheme towards exascale
  • Zheng, Gengbin; Kale, Laxmikant V.
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264677
conference June 2012
Improving Application Resilience by Extending Error Correction with Contextual Information conference November 2018
Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods conference November 2018
Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes conference November 2019
From tasks graphs to asynchronous distributed checkpointing with local restart conference November 2020
Supporting highly-decoupled thread-level redundancy for parallel programs conference February 2008
rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks conference July 2019
A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique conference July 2015
An evaluation of lazy fault detection based on Adaptive Redundant Multithreading conference September 2014
The Open Community Runtime: A runtime system for extreme scale computing conference September 2016
On the Resilience of Parallel Sparse Hybrid Solvers conference December 2015
A SIMD-based software fault tolerance for ARM processors conference May 2017
Combining Partial Redundancy and Checkpointing for HPC conference June 2012
Hybrid Checkpointing for MPI Jobs in HPC Environments conference December 2010
Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems conference October 2013
Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques conference August 2016
A PIN-Based Dynamic Software Fault Injection System
  • Jin, Ang; Jiang, Jianhui; Hu, Jiawei
  • 2008 9th International Conference for Young Computer Scientists (ICYCS), 2008 The 9th International Conference for Young Computer Scientists https://doi.org/10.1109/ICYCS.2008.329
conference November 2008
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance conference March 2007
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications
  • Di, Sheng; Bouguerra, Mohamed Slim; Bautista-Gomez, Leonardo
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.122
conference May 2014
Evaluating the Impact of SDC on the GMRES Iterative Solver
  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.123
conference May 2014
F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability
  • Guan, Qiang; Debardeleben, Nathan; Blanchard, Sean
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.128
conference May 2014
Fast Error-Bounded Lossy HPC Data Compression with SZ conference May 2016
Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors conference May 2016
Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization conference May 2017
Asynchronous Multigrid Methods conference May 2019
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale conference May 2019
Transient-fault recovery using simultaneous multithreading conference January 2002
SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation conference April 2017
Investigating the Resilience of Dynamic Loop Scheduling in Heterogeneous Computing Systems conference June 2015
DIVA: a reliable substrate for deep submicron microarchitecture design
  • Austin, T. M.
  • MICRO-32. 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture https://doi.org/10.1109/MICRO.1999.809458
conference January 1999
Recent Advances and New Avenues in Hardware-Level Reliability Support journal November 2005
RAJA: Portable Performance for Large-Scale Scientific Applications conference November 2019
Proactive Fault Tolerance Using Preemptive Migration
  • Engelmann, Christian; Vallee, Geoffroy R.; Naughton, Thomas
  • 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing https://doi.org/10.1109/PDP.2009.31
conference February 2009
NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart
  • Subasi, Omer; Arias, Javier; Unsal, Osman
  • 2015 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing https://doi.org/10.1109/PDP.2015.17
conference March 2015
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers conference January 2005
Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI conference November 2006
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
  • Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.18
conference November 2010
Design and modeling of a non-blocking checkpointing system
  • Sato, Kento; Maruyama, Naoya; Mohror, Kathryn
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.46
conference November 2012
Detection and correction of silent data corruption for large-scale high-performance computing
  • Fiala, David; Mueller, Frank; Engelmann, Christian
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.49
conference November 2012
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
  • Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78
conference November 2014
FlipBack: Automatic Targeted Protection against Silent Data Corruption conference November 2016
Programming Models and Development Software for a Space-Based Many-Core Processor
  • Crago, Stephen P.; Kang, Dong-In; Kang, Mikyung
  • 2011 IEEE International Conference on Space Mission Challenges for Information Technology (SMC-IT), 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology https://doi.org/10.1109/SMC-IT.2011.29
conference August 2011
Reduced Triple Modular redundancy for built-in self-repair in VLIW-processors
  • Scholzel, Mario
  • 2007 Signal Processing Algorithms, Architectures, Arrangements, and Applications (SPA 2007), Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA 2007 https://doi.org/10.1109/SPA.2007.5903294
conference September 2007
Performance Scaling Variability and Energy Analysis for a Resilient ULFM-based PDE Solver conference November 2016
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
An algorithm-based error detection scheme for the multigrid method journal September 2003
Basic concepts and taxonomy of dependable and secure computing journal January 2004
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications journal October 2016
Toward General Software Level Silent Data Corruption Detection for Parallel Applications journal December 2017
An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL journal April 2018
CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance journal March 2019
Algorithm-Based Fault Tolerance for Convolutional Neural Networks journal January 2021
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment journal January 2008
Fully Adaptive Multigrid Methods journal February 1993
A Posteriori Error Estimates Based on Hierarchical Bases journal August 1993
Density Estimation with Adaptive Sparse Grids for Large Data Sets conference April 2014
Fault Tolerant Computation with the Sparse Grid Combination Technique journal January 2015
Fine-Grained Parallel Incomplete LU Factorization journal January 2015
Resilience for Massively Parallel Multigrid Solvers journal January 2016
Massively Parallel Algorithms for the Lattice Boltzmann Method on NonUniform Grids journal January 2016
Interpolation-Restart Strategies for Resilient Eigensolvers journal January 2016
Discrete A Priori Bounds for the Detection of Corrupted PDE Solutions in Exascale Computations journal January 2017
Is the Multigrid Method Fault Tolerant? The Multilevel Case journal January 2017
On the Analysis of Block Smoothers for Saddle Point Problems journal January 2018
Extreme-Scale Block-Structured Adaptive Mesh Refinement journal January 2018
A Stencil Scaling Approach for Accelerating Matrix-Free Finite Element Implementations journal January 2018
A Class of Multirate Infinitesimal GARK Methods journal January 2019
Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals journal January 2000
OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES journal June 2011
Scalable, fault tolerant membership for MPI tasks on HPC systems conference January 2006
Proactive fault tolerance for HPC with Xen virtualization conference January 2007
CHARM++: a portable concurrent object oriented system based on C++
  • Kale, Laxmikant V.; Krishnan, Sanjeev
  • Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications - OOPSLA '93 https://doi.org/10.1145/165854.165874
conference January 1993
Characterizing the impact of soft errors on iterative methods in scientific computing conference January 2011
Algorithms and data structures for massively parallel generic adaptive finite element codes journal December 2011
FTI: high performance fault tolerance interface for hybrid systems
  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427
conference January 2011
Robust distributed orthogonalization based on randomized aggregation
  • Gansterer, Wilfried N.; Niederbrucker, Gerhard; Straková, Hana
  • Proceedings of the second workshop on Scalable algorithms for large-scale systems - ScalA '11 https://doi.org/10.1145/2133173.2133177
conference January 2011
Algorithm-based fault tolerance for dense matrix factorizations journal September 2012
Parallel reduction to hessenberg form with algorithm-based fault tolerance
  • Jia, Yulu; Bosilca, George; Luszczek, Piotr
  • SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2503210.2503249
conference November 2013
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing
  • Ropars, Thomas; Martsinkevich, Tatiana V.; Guermouche, Amina
  • SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2503210.2503271
conference November 2013
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods journal August 2013
Self-stabilizing iterative solvers conference January 2013
FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines conference January 2014
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM conference January 2014
MCALIB: Measuring Sensitivity to Rounding Error with Monte Carlo Programming journal April 2015
HPX: A Task Based Programming Model in a Global Address Space
  • Kaiser, Hartmut; Heller, Thomas; Adelstein-Lelbach, Bryce
  • Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models - PGAS '14 https://doi.org/10.1145/2676870.2676883
conference January 2014
Programmer-directed partial redundancy for resilient HPC
  • Subasi, Omer; Arias, Javier; Unsal, Osman
  • CF'15: Computing Frontiers Conference, Proceedings of the 12th ACM International Conference on Computing Frontiers https://doi.org/10.1145/2742854.2742903
conference May 2015
Resilient Matrix Multiplication of Hierarchical Semi-Separable Matrices
  • Austin, Brian; Roman, Eric; Li, Xiaoye
  • HPDC'15: The 24th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale https://doi.org/10.1145/2751504.2751507
conference June 2015
Exploiting asynchrony from exact forward recovery for DUE in iterative solvers
  • Jaulmes, Luc; Casas, Marc; Moretó, Miquel
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807599
conference November 2015
VOCL-FT: introducing techniques for efficient soft error coprocessor recovery
  • Peña, Antonio J.; Bland, Wesley; Balaji, Pavan
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807640
conference November 2015
Tuning stationary iterative solvers for fault resilience
  • Anzt, Hartwig; Dongarra, Jack; Quintana-Ortí, Enrique S.
  • Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '15 https://doi.org/10.1145/2832080.2832081
conference January 2015
New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
  • Tao, Dingwen; Song, Shuaiwen Leon; Krishnamoorthy, Sriram
  • Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16 https://doi.org/10.1145/2907294.2907306
conference January 2016
Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra
  • Wu, Panruo; Guan, Qiang; DeBardeleben, Nathan
  • HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/2907294.2907315
conference May 2016
ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner
  • Rizzi, Francesco; Morris, Karla; Sargsyan, Khachik
  • HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale https://doi.org/10.1145/2909428.2909429
conference May 2016
Mini-Ckpts: Surviving OS Failures in Persistent Memory
  • Fiala, David; Mueller, Frank; Ferreira, Kurt
  • ICS '16: 2016 International Conference on Supercomputing, Proceedings of the 2016 International Conference on Supercomputing https://doi.org/10.1145/2925426.2926295
conference June 2016
Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale conference January 2017
Correcting soft errors online in fast fourier transform
  • Liang, Xin; Chen, Zizhong; Chen, Jieyang
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126915
conference January 2017
REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed
  • Georgakoudis, Giorgis; Laguna, Ignacio; Nikolopoulos, Dimitrios S.
  • SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126972
conference November 2017
A Pattern Language for High-Performance Computing Resilience
  • Hukerikar, Saurabh; Engelmann, Christian
  • EuroPLoP '17: European Conference on Pattern Languages of Programs, Proceedings of the 22nd European Conference on Pattern Languages of Programs https://doi.org/10.1145/3147704.3147718
conference July 2017
A highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma simulations
  • Obersteiner, Michael; Hinojosa, Alfredo Parra; Heene, Mario
  • SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems https://doi.org/10.1145/3148226.3148229
conference November 2017
Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing
  • Ashraf, Rizwan A.; Hukerikar, Saurabh; Engelmann, Christian
  • ICPE '18: ACM/SPEC International Conference on Performance Engineering, Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering https://doi.org/10.1145/3184407.3184421
conference March 2018
D is CV ar: discovering critical variables using algorithmic differentiation for transient faults journal March 2018
Improving performance of iterative methods by lossy checkponting conference January 2018
Asynchronous Iterative Methods for Multiprocessors journal April 1978
Evaluating Support for OpenMP Offload Features conference January 2018
Towards resilient EU HPC systems: a blueprint conference April 2019
How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures
  • Pachajoa, Carlos; Levonyak, Markus; Gansterer, Wilfried N.
  • ICPP 2019: 48th International Conference on Parallel Processing, Proceedings of the 48th International Conference on Parallel Processing https://doi.org/10.1145/3337821.3337849
conference August 2019
Self-stabilizing systems in spite of distributed control journal November 1974
A survey of rollback-recovery protocols in message-passing systems journal September 2002
Toward Exascale Resilience journal September 2009
Silent error detection in numerical time-stepping schemes journal April 2014
Evaluating and extending user-level fault tolerance in MPI applications journal July 2016
Complex scientific applications made fault-tolerant with the sparse grid combination technique journal July 2016
Exploring versioned distributed arrays for resilience in scientific applications: global view resilience journal September 2016
Unified fault-tolerance framework for hybrid task-parallel message-passing applications journal September 2016
Soft fault detection and correction for multigrid journal February 2017
Algorithm-based fault recovery of adaptively refined parallel multilevel grids journal August 2017
Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions journal April 2018
A scalable and extensible checkpointing scheme for massively parallel simulations journal May 2018
Adaptive control in roll-forward recovery for extreme scale multigrid journal December 2018
Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction journal February 2021
Achieving algorithmic resilience for temporal integration through spectral deferred corrections journal January 2017
Towards Textbook Efficiency for Parallel Multigrid journal February 2015
Methods of conjugate gradients for solving linear systems journal December 1952

Similar Records

Design for a Soft Error Resilient Dynamic Task-Based Runtime, In: 2015 IEEE International Parallel and Distributed Processing Symposium
Conference · Fri May 01 00:00:00 EDT 2015 · 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS) · OSTI ID:1567397

Report on the Dagstuhl Seminar on Visualization and Monitoring of Network Traffic
Journal Article · Tue Jan 25 23:00:00 EST 2011 · Journal of Network and Systems Management, 18(2):232-236 · OSTI ID:1007354

Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact. Final report
Technical Report · Thu Apr 16 00:00:00 EDT 2020 · OSTI ID:1615150