DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Resiliency in numerical algorithm design for extreme scale simulations

Abstract

Here this work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the meanmore » time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.« less

Authors:
 [1];  [2];  [3];  [4];  [5];  [5];  [6];  [7];  [8];  [9];  [6];  [10];  [11];  [12];  [1];  [2];  [10];  [13];  [10];  [14] more »;  [15];  [2];  [16];  [6];  [17];  [18];  [19];  [6];  [20];  [21]; ORCiD logo [20];  [22];  [15];  [10];  [6];  [6] « less
  1. National Institute for Research in Digital Science and Technology (Inria), Rocquencourt (France)
  2. Univ. of Stuttgart (Germany)
  3. Karlsruher Institute of Technology (Germany)
  4. Barcelona Supercomputing Center (Spain)
  5. Polytechnic Univ. of Milan (Italy)
  6. Technical Univ. of Munich (Germany)
  7. NVIDIA Corporation, Santa Clara, CA (United States)
  8. Univ. of Basel (Switzerland)
  9. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
  10. Univ. of Erlangen, Nuremberg (Germany)
  11. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  12. Univ. of Vienna (Austria)
  13. Paris-Pantheon-Assas Univ., Paris (France)
  14. Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
  15. Univ. of Bordeaux (France)
  16. Cerfacs, Toulouse (France)
  17. Polytechnic Univ. of Valencia (UPV) (Spain)
  18. NexGen Analytics, Sheridan, WY (United States)
  19. Univ. of Erlangen, Nuremberg (Germany); Cerfacs, Toulouse (France)
  20. Australian National Univ., Canberra, ACT (Australia)
  21. Forschungszentrum Jülich GmbH (Germany)
  22. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States); Los Alamos National Laboratory (LANL), Los Alamos, NM (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1855669
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
International Journal of High Performance Computing Applications
Additional Journal Information:
Journal Volume: 36; Journal Issue: 2; Journal ID: ISSN 1094-3420
Publisher:
SAGE
Country of Publication:
United States
Language:
English
Subject:
79 ASTRONOMY AND ASTROPHYSICS; numerical algorithms; parallel computer architecture; fault tolerance; resilience

Citation Formats

Agullo, Emmanuel, Altenbernd, Mirco, Anzt, Hartwig, Bautista-Gomez, Leonardo, Benacchio, Tommaso, Bonaventura, Luca, Bungartz, Hans-Joachim, Chatterjee, Sanjay, Ciorba, Florina M., DeBardeleben, Nathan, Drzisga, Daniel, Eibl, Sebastian, Engelmann, Christian, Gansterer, Wilfried N., Giraud, Luc, Göddeke, Dominik, Heisig, Marco, Jézéquel, Fabienne, Kohl, Nils, Li, Xiaoye Sherry, Lion, Romain, Mehl, Miriam, Mycek, Paul, Obersteiner, Michael, Quintana-Ortí, Enrique S., Rizzi, Francesco, Rüde, Ulrich, Schulz, Martin, Fung, Fred, Speck, Robert, Stals, Linda, Teranishi, Keita, Thibault, Samuel, Thönnes, Dominik, Wagner, Andreas, and Wohlmuth, Barbara. Resiliency in numerical algorithm design for extreme scale simulations. United States: N. p., 2021. Web. doi:10.1177/10943420211055188.
Agullo, Emmanuel, Altenbernd, Mirco, Anzt, Hartwig, Bautista-Gomez, Leonardo, Benacchio, Tommaso, Bonaventura, Luca, Bungartz, Hans-Joachim, Chatterjee, Sanjay, Ciorba, Florina M., DeBardeleben, Nathan, Drzisga, Daniel, Eibl, Sebastian, Engelmann, Christian, Gansterer, Wilfried N., Giraud, Luc, Göddeke, Dominik, Heisig, Marco, Jézéquel, Fabienne, Kohl, Nils, Li, Xiaoye Sherry, Lion, Romain, Mehl, Miriam, Mycek, Paul, Obersteiner, Michael, Quintana-Ortí, Enrique S., Rizzi, Francesco, Rüde, Ulrich, Schulz, Martin, Fung, Fred, Speck, Robert, Stals, Linda, Teranishi, Keita, Thibault, Samuel, Thönnes, Dominik, Wagner, Andreas, & Wohlmuth, Barbara. Resiliency in numerical algorithm design for extreme scale simulations. United States. https://doi.org/10.1177/10943420211055188
Agullo, Emmanuel, Altenbernd, Mirco, Anzt, Hartwig, Bautista-Gomez, Leonardo, Benacchio, Tommaso, Bonaventura, Luca, Bungartz, Hans-Joachim, Chatterjee, Sanjay, Ciorba, Florina M., DeBardeleben, Nathan, Drzisga, Daniel, Eibl, Sebastian, Engelmann, Christian, Gansterer, Wilfried N., Giraud, Luc, Göddeke, Dominik, Heisig, Marco, Jézéquel, Fabienne, Kohl, Nils, Li, Xiaoye Sherry, Lion, Romain, Mehl, Miriam, Mycek, Paul, Obersteiner, Michael, Quintana-Ortí, Enrique S., Rizzi, Francesco, Rüde, Ulrich, Schulz, Martin, Fung, Fred, Speck, Robert, Stals, Linda, Teranishi, Keita, Thibault, Samuel, Thönnes, Dominik, Wagner, Andreas, and Wohlmuth, Barbara. Fri . "Resiliency in numerical algorithm design for extreme scale simulations". United States. https://doi.org/10.1177/10943420211055188. https://www.osti.gov/servlets/purl/1855669.
@article{osti_1855669,
title = {Resiliency in numerical algorithm design for extreme scale simulations},
author = {Agullo, Emmanuel and Altenbernd, Mirco and Anzt, Hartwig and Bautista-Gomez, Leonardo and Benacchio, Tommaso and Bonaventura, Luca and Bungartz, Hans-Joachim and Chatterjee, Sanjay and Ciorba, Florina M. and DeBardeleben, Nathan and Drzisga, Daniel and Eibl, Sebastian and Engelmann, Christian and Gansterer, Wilfried N. and Giraud, Luc and Göddeke, Dominik and Heisig, Marco and Jézéquel, Fabienne and Kohl, Nils and Li, Xiaoye Sherry and Lion, Romain and Mehl, Miriam and Mycek, Paul and Obersteiner, Michael and Quintana-Ortí, Enrique S. and Rizzi, Francesco and Rüde, Ulrich and Schulz, Martin and Fung, Fred and Speck, Robert and Stals, Linda and Teranishi, Keita and Thibault, Samuel and Thönnes, Dominik and Wagner, Andreas and Wohlmuth, Barbara},
abstractNote = {Here this work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.},
doi = {10.1177/10943420211055188},
journal = {International Journal of High Performance Computing Applications},
number = 2,
volume = 36,
place = {United States},
year = {Fri Dec 10 00:00:00 EST 2021},
month = {Fri Dec 10 00:00:00 EST 2021}
}

Works referenced in this record:

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets
conference, May 2019

  • Keller, Kai; Bautista-Gomez, Leonardo
  • 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
  • DOI: 10.1109/CCGRID.2019.00015

Scalable, fault tolerant membership for MPI tasks on HPC systems
conference, January 2006

  • Varma, Jyothish; Wang, Chao; Mueller, Frank
  • Proceedings of the 20th annual international conference on Supercomputing - ICS '06
  • DOI: 10.1145/1183401.1183433

Toward fault-tolerant parallel-in-time integration with PFASST
journal, February 2017


Correcting soft errors online in fast fourier transform
conference, January 2017

  • Liang, Xin; Chen, Zizhong; Chen, Jieyang
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • DOI: 10.1145/3126908.3126915

A highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma simulations
conference, November 2017

  • Obersteiner, Michael; Hinojosa, Alfredo Parra; Heene, Mario
  • SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
  • DOI: 10.1145/3148226.3148229

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
conference, March 2007

  • Wang, Chao; Mueller, Frank; Engelmann, Christian
  • 2007 IEEE International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2007.370307

The Open Community Runtime: A runtime system for extreme scale computing
conference, September 2016

  • Mattson, Timothy G.; Cledat, Romain; Cave, Vincent
  • 2016 IEEE High Performance Extreme Computing Conference (HPEC)
  • DOI: 10.1109/HPEC.2016.7761580

A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique
conference, July 2015

  • Ali, Md Mohsin; Strazdins, Peter E.; Harding, Brendan
  • 2015 International Conference on High Performance Computing & Simulation (HPCS)
  • DOI: 10.1109/HPCSim.2015.7237082

ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability
journal, January 2012


Algorithm-based fault tolerance for dense matrix factorizations
journal, September 2012

  • Du, Peng; Bouteiller, Aurelien; Bosilca, George
  • ACM SIGPLAN Notices, Vol. 47, Issue 8
  • DOI: 10.1145/2370036.2145845

An evaluation of lazy fault detection based on Adaptive Redundant Multithreading
conference, September 2014

  • Hukerikar, Saurabh; Teranishi, Keita; Diniz, Pedro C.
  • 2014 IEEE High Performance Extreme Computing Conference (HPEC)
  • DOI: 10.1109/HPEC.2014.7040999

Exploiting asynchrony from exact forward recovery for DUE in iterative solvers
conference, November 2015

  • Jaulmes, Luc; Casas, Marc; Moretó, Miquel
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1145/2807591.2807599

Investigating the Resilience of Dynamic Loop Scheduling in Heterogeneous Computing Systems
conference, June 2015

  • Sukhija, Nitin; Banicescu, Ioana; Ciorba, Florina M.
  • 2015 14th International Symposium on Parallel and Distributed Computing (ISPDC)
  • DOI: 10.1109/ISPDC.2015.29

CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance
journal, March 2019

  • Shahzad, Faisal; Thies, Jonas; Kreutzer, Moritz
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 30, Issue 3
  • DOI: 10.1109/TPDS.2018.2866794

MCALIB: Measuring Sensitivity to Rounding Error with Monte Carlo Programming
journal, April 2015

  • Frechtling, Michael; Leong, Philip H. W.
  • ACM Transactions on Programming Languages and Systems, Vol. 37, Issue 2
  • DOI: 10.1145/2665073

Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations
conference, May 2019

  • Cavelan, Aurelien; Cabezon, Ruben M.; Ciorba, Florina M.
  • 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
  • DOI: 10.1109/CCGRID.2019.00013

Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

  • Bosilca, George; Delmas, Rémi; Dongarra, Jack
  • Journal of Parallel and Distributed Computing, Vol. 69, Issue 4
  • DOI: 10.1016/j.jpdc.2008.12.002

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010

  • Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2010.18

Evaluating and extending user-level fault tolerance in MPI applications
journal, July 2016

  • Laguna, Ignacio; Richards, David F.; Gamblin, Todd
  • The International Journal of High Performance Computing Applications, Vol. 30, Issue 3
  • DOI: 10.1177/1094342015623623

A multirate time stepping strategy for stiff ordinary differential equations
journal, November 2006


A dimension adaptive sparse grid combination technique for machine learning
journal, April 2007


A fault tolerant approach to microprocessor design
conference, January 2001

  • Weaver, C.; Austin, T.
  • Proceedings International Conference on Dependable Systems and Networks
  • DOI: 10.1109/DSN.2001.941425

Characterizing the impact of soft errors on iterative methods in scientific computing
conference, January 2011

  • Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
  • Proceedings of the international conference on Supercomputing - ICS '11
  • DOI: 10.1145/1995896.1995922

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
journal, February 2021

  • Benacchio, Tommaso; Bonaventura, Luca; Altenbernd, Mirco
  • The International Journal of High Performance Computing Applications, Vol. 35, Issue 4
  • DOI: 10.1177/1094342021990433

Berkeley lab checkpoint/restart (BLCR) for Linux clusters
journal, September 2006


An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL
journal, April 2018

  • Tang, Xiongchao; Zhai, Jidong; Yu, Bowen
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 29, Issue 4
  • DOI: 10.1109/TPDS.2017.2781257

Algorithm-Based Fault Tolerance for Parallel Stencil Computations
conference, September 2019


Methods of conjugate gradients for solving linear systems
journal, December 1952

  • Hestenes, M. R.; Stiefel, E.
  • Journal of Research of the National Bureau of Standards, Vol. 49, Issue 6
  • DOI: 10.6028/jres.049.044

Comparison between adaptive and uniform discontinuous Galerkin simulations in dry 2D bubble experiments
journal, February 2013

  • Müller, Andreas; Behrens, Jörn; Giraldo, Francis X.
  • Journal of Computational Physics, Vol. 235
  • DOI: 10.1016/j.jcp.2012.10.038

Fully Adaptive Multigrid Methods
journal, February 1993

  • Rüde, Ulrich
  • SIAM Journal on Numerical Analysis, Vol. 30, Issue 1
  • DOI: 10.1137/0730011

Tuning stationary iterative solvers for fault resilience
conference, January 2015

  • Anzt, Hartwig; Dongarra, Jack; Quintana-Ortí, Enrique S.
  • Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '15
  • DOI: 10.1145/2832080.2832081

A two-scale approach for efficient on-the-fly operator assembly in massively parallel high performance multigrid codes
journal, December 2017


A PIN-Based Dynamic Software Fault Injection System
conference, November 2008

  • Jin, Ang; Jiang, Jianhui; Hu, Jiawei
  • 2008 9th International Conference for Young Computer Scientists (ICYCS), 2008 The 9th International Conference for Young Computer Scientists
  • DOI: 10.1109/ICYCS.2008.329

Extreme-Scale Block-Structured Adaptive Mesh Refinement
journal, January 2018

  • Schornbaum, Florian; Rüde, Ulrich
  • SIAM Journal on Scientific Computing, Vol. 40, Issue 3
  • DOI: 10.1137/17M1128411

A Stencil Scaling Approach for Accelerating Matrix-Free Finite Element Implementations
journal, January 2018

  • Bauer, S.; Drzisga, D.; Mohr, M.
  • SIAM Journal on Scientific Computing, Vol. 40, Issue 6
  • DOI: 10.1137/17M1148384

Discrete Stochastic Arithmetic for Validating Results of Numerical Software
journal, December 2004


rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks
conference, July 2019

  • Mohammed, Ali; Cavelan, Aurelien; Ciorba, Florina M.
  • 2019 International Conference on High Performance Computing & Simulation (HPCS)
  • DOI: 10.1109/HPCS48598.2019.9188153

An efficient parallel implementation of explicit multirate Runge–Kutta schemes for discontinuous Galerkin computations
journal, January 2014

  • Seny, Bruno; Lambrechts, Jonathan; Toulorge, Thomas
  • Journal of Computational Physics, Vol. 256
  • DOI: 10.1016/j.jcp.2013.07.041

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
conference, November 2014

  • Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2014.78

Achieving algorithmic resilience for temporal integration through spectral deferred corrections
journal, January 2017

  • Grout, Ray; Kolla, Hemanth; Minion, Michael
  • Communications in Applied Mathematics and Computational Science, Vol. 12, Issue 1
  • DOI: 10.2140/camcos.2017.12.25

Resilient Matrix Multiplication of Hierarchical Semi-Separable Matrices
conference, June 2015

  • Austin, Brian; Roman, Eric; Li, Xiaoye
  • HPDC'15: The 24th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale
  • DOI: 10.1145/2751504.2751507

PapyrusKV: a high-performance parallel key-value store for distributed NVM architectures
conference, January 2017

  • Kim, Jungwon; Lee, Seyong; Vetter, Jeffrey S.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • DOI: 10.1145/3126908.3126943

Algorithm-based fault recovery of adaptively refined parallel multilevel grids
journal, August 2017

  • Stals, Linda
  • The International Journal of High Performance Computing Applications, Vol. 33, Issue 1
  • DOI: 10.1177/1094342017720801

FTI: high performance fault tolerance interface for hybrid systems
conference, January 2011

  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • DOI: 10.1145/2063384.2063427

On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing
journal, October 2015


Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing
conference, March 2018

  • Ashraf, Rizwan A.; Hukerikar, Saurabh; Engelmann, Christian
  • ICPE '18: ACM/SPEC International Conference on Performance Engineering, Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering
  • DOI: 10.1145/3184407.3184421

Multivariate Quadrature on Adaptive Sparse Grids
journal, August 2003


Algorithms and data structures for massively parallel generic adaptive finite element codes
journal, December 2011

  • Bangerth, Wolfgang; Burstedde, Carsten; Heister, Timo
  • ACM Transactions on Mathematical Software, Vol. 38, Issue 2
  • DOI: 10.1145/2049673.2049678

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures
conference, August 2019

  • Pachajoa, Carlos; Levonyak, Markus; Gansterer, Wilfried N.
  • ICPP 2019: 48th International Conference on Parallel Processing, Proceedings of the 48th International Conference on Parallel Processing
  • DOI: 10.1145/3337821.3337849

A self adjusting multirate algorithm for robust time discretization of partial differential equations
journal, April 2020

  • Bonaventura, L.; Casella, F.; Carciopolo, L. Delpopolo
  • Computers & Mathematics with Applications, Vol. 79, Issue 7
  • DOI: 10.1016/j.camwa.2019.11.023

Reduced Triple Modular redundancy for built-in self-repair in VLIW-processors
conference, September 2007

  • Scholzel, Mario
  • 2007 Signal Processing Algorithms, Architectures, Arrangements, and Applications (SPA 2007), Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA 2007
  • DOI: 10.1109/SPA.2007.5903294

Fault Tolerance in the Parareal Method
conference, May 2016

  • Nielsen, Allan S.; Hesthaven, Jan S.
  • HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale
  • DOI: 10.1145/2909428.2909431

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale
conference, May 2019

  • Nicolae, Bogdan; Moody, Adam; Gonsiorowski, Elsa
  • 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2019.00099

Toward Exascale Resilience
journal, September 2009

  • Cappello, Franck; Geist, Al; Gropp, Bill
  • The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
  • DOI: 10.1177/1094342009347767

Multirate linear multistep methods
journal, December 1984


Complex scientific applications made fault-tolerant with the sparse grid combination technique
journal, July 2016

  • Ali, Md Mohsin; Strazdins, Peter E.; Harding, Brendan
  • The International Journal of High Performance Computing Applications, Vol. 30, Issue 3
  • DOI: 10.1177/1094342015628056

FlipBack: Automatic Targeted Protection against Silent Data Corruption
conference, November 2016

  • Ni, Xiang; Kale, Laxmikant V.
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2016.28

Robust distributed orthogonalization based on randomized aggregation
conference, January 2011

  • Gansterer, Wilfried N.; Niederbrucker, Gerhard; Straková, Hana
  • Proceedings of the second workshop on Scalable algorithms for large-scale systems - ScalA '11
  • DOI: 10.1145/2133173.2133177

Resilience for Massively Parallel Multigrid Solvers
journal, January 2016

  • Huber, Markus; Gmeiner, Björn; Rüde, Ulrich
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 5
  • DOI: 10.1137/15M1026122

Parallel adaptive FETI‐DP using lightweight asynchronous dynamic load balancing
journal, October 2019

  • Klawonn, Axel; Kühn, Martin J.; Rheinbach, Oliver
  • International Journal for Numerical Methods in Engineering, Vol. 121, Issue 4
  • DOI: 10.1002/nme.6237

Fault tolerant communication-optimal 2.5D matrix multiplication
journal, June 2017

  • Moldaschl, Michael; Prikopa, Karl E.; Gansterer, Wilfried N.
  • Journal of Parallel and Distributed Computing, Vol. 104
  • DOI: 10.1016/j.jpdc.2017.01.022

On asynchronous iterations
journal, November 2000


Programming Models and Development Software for a Space-Based Many-Core Processor
conference, August 2011

  • Crago, Stephen P.; Kang, Dong-In; Kang, Mikyung
  • 2011 IEEE International Conference on Space Mission Challenges for Information Technology (SMC-IT), 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology
  • DOI: 10.1109/SMC-IT.2011.29

Soft fault detection and correction for multigrid
journal, February 2017

  • Altenbernd, Mirco; Göddeke, Dominik
  • The International Journal of High Performance Computing Applications, Vol. 32, Issue 6
  • DOI: 10.1177/1094342016684006

Evaluating Support for OpenMP Offload Features
conference, January 2018

  • Diaz, Jose Monsalve; Pophale, Swaroop; Friedline, Kyle
  • Proceedings of the 47th International Conference on Parallel Processing Companion - ICPP '18
  • DOI: 10.1145/3229710.3229717

Anisotropic mesh adaptivity for multi-scale ocean modelling
journal, November 2009

  • Piggott, M. D.; Farrell, P. E.; Wilson, C. R.
  • Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 367, Issue 1907
  • DOI: 10.1098/rsta.2009.0155

Fine-Grained Parallel Incomplete LU Factorization
journal, January 2015

  • Chow, Edmond; Patel, Aftab
  • SIAM Journal on Scientific Computing, Vol. 37, Issue 2
  • DOI: 10.1137/140968896

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
journal, January 2008

  • Langou, J.; Chen, Z.; Bosilca, G.
  • SIAM Journal on Scientific Computing, Vol. 30, Issue 1
  • DOI: 10.1137/040620394

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques
conference, August 2016

  • Gamell, Marc; Katz, Daniel S.; Teranishi, Keita
  • 2016 45th International Conference on Parallel Processing Workshops (ICPPW)
  • DOI: 10.1109/ICPPW.2016.56

ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner
conference, May 2016

  • Rizzi, Francesco; Morris, Karla; Sargsyan, Khachik
  • HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale
  • DOI: 10.1145/2909428.2909429

Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions
journal, April 2018

  • Casas, Marc; Gansterer, Wilfried N.; Wimmer, Elias
  • The International Journal of High Performance Computing Applications, Vol. 33, Issue 2
  • DOI: 10.1177/1094342018762531

Unified fault-tolerance framework for hybrid task-parallel message-passing applications
journal, September 2016

  • Subasi, Omer; Martsinkevich, Tatiana; Zyulkyarov, Ferad
  • The International Journal of High Performance Computing Applications, Vol. 32, Issue 5
  • DOI: 10.1177/1094342016669416

A method of finite element tearing and interconnecting and its parallel solution algorithm
journal, October 1991

  • Farhat, Charbel; Roux, Francois-Xavier
  • International Journal for Numerical Methods in Engineering, Vol. 32, Issue 6
  • DOI: 10.1002/nme.1620320604

REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed
conference, November 2017

  • Georgakoudis, Giorgis; Laguna, Ignacio; Nikolopoulos, Dimitrios S.
  • SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1145/3126908.3126972

Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications
conference, May 2014

  • Di, Sheng; Bouguerra, Mohamed Slim; Bautista-Gomez, Leonardo
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.122

Discrete A Priori Bounds for the Detection of Corrupted PDE Solutions in Exascale Computations
journal, January 2017

  • Mycek, Paul; Rizzi, Francesco; Maître, Olivier Le
  • SIAM Journal on Scientific Computing, Vol. 39, Issue 1
  • DOI: 10.1137/15M1051786

Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods
conference, November 2018

  • Pachajoa, Carlos; Levonyak, Markus; Gansterer, Wilfried N.
  • 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
  • DOI: 10.1109/FTXS.2018.00009

Numerical recovery strategies for parallel resilient Krylov linear solvers: RESILIENCY IN KRYLOV LINEAR SOLVERS
journal, August 2016

  • Agullo, Emmanuel; Giraud, Luc; Guermouche, Abdou
  • Numerical Linear Algebra with Applications, Vol. 23, Issue 5
  • DOI: 10.1002/nla.2059

Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems
conference, October 2013

  • Engelmann, Christian; Naughton, Thomas
  • 2013 42nd International Conference on Parallel Processing (ICPP)
  • DOI: 10.1109/ICPP.2013.114

Debugging and Optimization of HPC Programs with the Verrou Tool
conference, November 2019

  • Fevotte, Francois; Lathuiliere, Bruno
  • 2019 IEEE/ACM 3rd International Workshop on Software Correctness for HPC Applications (Correctness)
  • DOI: 10.1109/Correctness49594.2019.00006

CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications: CPPC: COMPILER-ASSISTED PORTABLE CHECKPOINTING
journal, November 2009

  • Rodríguez, Gabriel; Martín, María J.; González, Patricia
  • Concurrency and Computation: Practice and Experience, Vol. 22, Issue 6
  • DOI: 10.1002/cpe.1541

Local rollback for resilient MPI applications with application-level checkpointing and message logging
journal, February 2019


A SIMD-based software fault tolerance for ARM processors
conference, May 2017

  • Lin, Shun-Zhi; Chen, Peng-Sheng
  • 2017 International Conference on Applied System Innovation (ICASI)
  • DOI: 10.1109/ICASI.2017.7988587

Improving Application Resilience by Extending Error Correction with Contextual Information
conference, November 2018

  • Poulos, Alexandra; Wallace, Dylan; Robey, Robert
  • 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
  • DOI: 10.1109/FTXS.2018.00006

DIVA: a reliable substrate for deep submicron microarchitecture design
conference, January 1999

  • Austin, T. M.
  • MICRO-32. 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture
  • DOI: 10.1109/MICRO.1999.809458

Improving performance of iterative methods by lossy checkponting
conference, January 2018

  • Tao, Dingwen; Di, Sheng; Liang, Xin
  • Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18
  • DOI: 10.1145/3208040.3208050

RAJA: Portable Performance for Large-Scale Scientific Applications
conference, November 2019

  • Beckingsale, David A.; Scogland, Thomas RW; Burmark, Jason
  • 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)
  • DOI: 10.1109/P3HPC49587.2019.00012

Fault-tolerant least squares solvers for wireless sensor networks based on gossiping
journal, February 2020

  • Prikopa, Karl E.; Gansterer, Wilfried N.
  • Journal of Parallel and Distributed Computing, Vol. 136
  • DOI: 10.1016/j.jpdc.2019.09.006

Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
journal, October 2016

  • Di, Sheng; Cappello, Franck
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 10
  • DOI: 10.1109/TPDS.2016.2517639

Chaotic relaxation
journal, April 1969


VOCL-FT: introducing techniques for efficient soft error coprocessor recovery
conference, November 2015

  • Peña, Antonio J.; Bland, Wesley; Balaji, Pavan
  • SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1145/2807591.2807640

Toward General Software Level Silent Data Corruption Detection for Parallel Applications
journal, December 2017

  • Berrocal, Eduardo; Bautista-Gomez, Leonardo; Di, Sheng
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 28, Issue 12
  • DOI: 10.1109/TPDS.2017.2735971

Stochastic subspace correction methods and fault tolerance
journal, August 2019

  • Griebel, Michael; Oswald, Peter
  • Mathematics of Computation, Vol. 89, Issue 321
  • DOI: 10.1090/mcom/3459

Tsunami modelling with adaptively refined finite volume methods
journal, April 2011


Proactive fault tolerance for HPC with Xen virtualization
conference, January 2007

  • Nagarajan, Arun Babu; Mueller, Frank; Engelmann, Christian
  • Proceedings of the 21st annual international conference on Supercomputing - ICS '07
  • DOI: 10.1145/1274971.1274978

Combining Partial Redundancy and Checkpointing for HPC
conference, June 2012

  • Elliott, James; Kharbas, Kishor; Fiala, David
  • 2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS)
  • DOI: 10.1109/ICDCS.2012.56

Adaptive control in roll-forward recovery for extreme scale multigrid
journal, December 2018

  • Huber, Markus; Rüde, Ulrich; Wohlmuth, Barbara
  • The International Journal of High Performance Computing Applications, Vol. 33, Issue 5
  • DOI: 10.1177/1094342018817088

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
journal, December 2014

  • Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel
  • Journal of Parallel and Distributed Computing, Vol. 74, Issue 12
  • DOI: 10.1016/j.jpdc.2014.07.003

SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation
conference, April 2017

  • Hari, Siva Kumar Sastry; Tsai, Timothy; Stephenson, Mark
  • 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
  • DOI: 10.1109/ISPASS.2017.7975296

Hybrid Checkpointing for MPI Jobs in HPC Environments
conference, December 2010

  • Wang, Chao; Mueller, Frank; Engelmann, Christian
  • 2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS)
  • DOI: 10.1109/ICPADS.2010.48

Exploring versioned distributed arrays for resilience in scientific applications: global view resilience
journal, September 2016

  • Chien, A.; Balaji, P.; Dun, N.
  • The International Journal of High Performance Computing Applications, Vol. 31, Issue 6
  • DOI: 10.1177/1094342016664796

Fine-grained bit-flip protection for relaxation methods
journal, September 2019

  • Anzt, Hartwig; Dongarra, Jack; Quintana-Ortí, Enrique S.
  • Journal of Computational Science, Vol. 36
  • DOI: 10.1016/j.jocs.2016.11.013

A scalable and extensible checkpointing scheme for massively parallel simulations
journal, May 2018

  • Kohl, Nils; Hötzer, Johannes; Schornbaum, Florian
  • The International Journal of High Performance Computing Applications, Vol. 33, Issue 4
  • DOI: 10.1177/1094342018767736

Asynchronous Iterative Methods for Multiprocessors
journal, April 1978


Error detection by duplicated instructions in super-scalar processors
journal, March 2002

  • Oh, N.; Shirvani, P. P.; McCluskey, E. J.
  • IEEE Transactions on Reliability, Vol. 51, Issue 1
  • DOI: 10.1109/24.994913

Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization
conference, May 2017

  • Tao, Dingwen; Di, Sheng; Chen, Zizhong
  • 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2017.115

Fault Tolerance Properties of Gossip-Based Distributed Orthogonal Iteration Methods
journal, January 2013


HPX: A Task Based Programming Model in a Global Address Space
conference, January 2014

  • Kaiser, Hartmut; Heller, Thomas; Adelstein-Lelbach, Bryce
  • Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models - PGAS '14
  • DOI: 10.1145/2676870.2676883

Scalable and fault tolerant orthogonalization based on randomized distributed data aggregation
journal, November 2013

  • Gansterer, Wilfried N.; Niederbrucker, Gerhard; Straková, Hana
  • Journal of Computational Science, Vol. 4, Issue 6
  • DOI: 10.1016/j.jocs.2013.01.006

Supporting highly-decoupled thread-level redundancy for parallel programs
conference, February 2008

  • Rashid, M. Wasiur; Huang, Michael C.
  • 2008 IEEE 14th International Symposium on High Performance Computer Architecture (HPCA)
  • DOI: 10.1109/HPCA.2008.4658655

Parallel reduction to hessenberg form with algorithm-based fault tolerance
conference, November 2013

  • Jia, Yulu; Bosilca, George; Luszczek, Piotr
  • SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1145/2503210.2503249

Asynchronous Iterative Algorithms with Flexible Communication for Nonlinear Network Flow Problems
journal, October 1996

  • El Baz, Didier; Spiteri, Pierre; Miellou, Jean Claude
  • Journal of Parallel and Distributed Computing, Vol. 38, Issue 1
  • DOI: 10.1006/jpdc.1996.0124

The GeoClaw software for depth-averaged flows with adaptive refinement
journal, September 2011


ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
conference, June 2012

  • Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
  • DOI: 10.1109/DSNW.2012.6264672

A scalable double in-memory checkpoint and restart scheme towards exascale
conference, June 2012

  • Zheng, Gengbin; Kale, Laxmikant V.
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
  • DOI: 10.1109/DSNW.2012.6264677

Regression with the optimised combination technique
conference, January 2006

  • Garcke, Jochen
  • Proceedings of the 23rd international conference on Machine learning - ICML '06
  • DOI: 10.1145/1143844.1143885

Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets
conference, December 2018


Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale
conference, January 2017

  • Benoit, Anne; Cavelan, Aurélien; Cappello, Franck
  • Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale - FTXS '17
  • DOI: 10.1145/3086157.3086162

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI
conference, November 2006

  • Coti, Camille; Herault, Thomas; Lemarinier, Pierre
  • SC 2006 Proceedings Supercomputing 2006, ACM/IEEE SC 2006 Conference (SC'06)
  • DOI: 10.1109/SC.2006.15

Proactive process-level live migration and back migration in HPC environments
journal, February 2012

  • Wang, Chao; Mueller, Frank; Engelmann, Christian
  • Journal of Parallel and Distributed Computing, Vol. 72, Issue 2, p. 254-267
  • DOI: 10.1016/j.jpdc.2011.10.009

Density Estimation with Adaptive Sparse Grids for Large Data Sets
conference, April 2014

  • Peherstorfer, Benjamin; Pflüge, Dirk; Bungartz, Hans-Joachim
  • Proceedings of the 2014 SIAM International Conference on Data Mining
  • DOI: 10.1137/1.9781611973440.51

Programmer-directed partial redundancy for resilient HPC
conference, May 2015

  • Subasi, Omer; Arias, Javier; Unsal, Osman
  • CF'15: Computing Frontiers Conference, Proceedings of the 12th ACM International Conference on Computing Frontiers
  • DOI: 10.1145/2742854.2742903

On the Resilience of Parallel Sparse Hybrid Solvers
conference, December 2015

  • Agullo, Emmanuel; Giraud, Luc; Zounon, Mawussi
  • 2015 IEEE 22nd International Conference on High Performance Computing (HiPC)
  • DOI: 10.1109/HiPC.2015.9

Does partial replication pay off?
conference, June 2012

  • Stearley, Jon; Ferreira, Kurt; Robinson, David
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
  • DOI: 10.1109/DSNW.2012.6264669

Proactive Fault Tolerance Using Preemptive Migration
conference, February 2009

  • Engelmann, Christian; Vallee, Geoffroy R.; Naughton, Thomas
  • 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
  • DOI: 10.1109/PDP.2009.31

Design and modeling of a non-blocking checkpointing system
conference, November 2012

  • Sato, Kento; Maruyama, Naoya; Mohror, Kathryn
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2012.46

FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines
conference, January 2014

  • Wu, Panruo; Chen, Zizhong
  • Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14
  • DOI: 10.1145/2600212.2600232

Detection and correction of silent data corruption for large-scale high-performance computing
conference, November 2012

  • Fiala, David; Mueller, Frank; Engelmann, Christian
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2012.49

Dynamic Malleability in Iterative MPI Applications
conference, May 2007

  • El Maghraoui, Kaoutar; Desell, Travis J.; Szymanski, Boleslaw K.
  • Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)
  • DOI: 10.1109/CCGRID.2007.45

A Pattern Language for High-Performance Computing Resilience
conference, July 2017

  • Hukerikar, Saurabh; Engelmann, Christian
  • EuroPLoP '17: European Conference on Pattern Languages of Programs, Proceedings of the 22nd European Conference on Pattern Languages of Programs
  • DOI: 10.1145/3147704.3147718

Asynchronous optimized Schwarz methods with and without overlap
journal, March 2017

  • Magoulès, Frédéric; Szyld, Daniel B.; Venet, Cédric
  • Numerische Mathematik, Vol. 137, Issue 1
  • DOI: 10.1007/s00211-017-0872-z

MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection
conference, September 2017

  • Subasi, Omer; Di, Sheng; Balaprakash, Prasanna
  • 2017 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2017.128

F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability
conference, May 2014

  • Guan, Qiang; Debardeleben, Nathan; Blanchard, Sean
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.128

An ABFT Scheme Based on Communication Characteristics
conference, September 2016

  • Kabir, Upama; Goswami, Dhrubajyoti
  • 2016 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2016.68

SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing
conference, November 2013

  • Ropars, Thomas; Martsinkevich, Tatiana V.; Guermouche, Amina
  • SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1145/2503210.2503271

Towards resilient EU HPC systems: a blueprint
conference, April 2019

  • Radojkovic, Petar
  • CF '19: Computing Frontiers Conference, Proceedings of the 16th ACM International Conference on Computing Frontiers
  • DOI: 10.1145/3310273.3323434

A conservative implicit multirate method for hyperbolic problems
journal, August 2018

  • Delpopolo Carciopolo, Ludovica; Bonaventura, Luca; Scotti, Anna
  • Computational Geosciences, Vol. 23, Issue 4
  • DOI: 10.1007/s10596-018-9764-2

D is CV ar: discovering critical variables using algorithmic differentiation for transient faults
journal, March 2018


Performance Scaling Variability and Energy Analysis for a Resilient ULFM-based PDE Solver
conference, November 2016

  • Morris, K.; Rizzi, F.; Cook, B.
  • 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
  • DOI: 10.1109/ScalA.2016.010

NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart
conference, March 2015

  • Subasi, Omer; Arias, Javier; Unsal, Osman
  • 2015 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing
  • DOI: 10.1109/PDP.2015.17

Is the Multigrid Method Fault Tolerant? The Multilevel Case
journal, January 2017

  • Ainsworth, Mark; Glusa, Christian
  • SIAM Journal on Scientific Computing, Vol. 39, Issue 6
  • DOI: 10.1137/16M1097274

A survey of rollback-recovery protocols in message-passing systems
journal, September 2002

  • Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
  • ACM Computing Surveys, Vol. 34, Issue 3
  • DOI: 10.1145/568522.568525

Asynchronous Multigrid Methods
conference, May 2019

  • Wolfson-Pou, Jordi; Chow, Edmond
  • 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2019.00021

A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets
conference, September 2016

  • Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad
  • 2016 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2016.54

Basic concepts and taxonomy of dependable and secure computing
journal, January 2004

  • Avizienis, A.; Laprie, J. -C.; Randell, B.
  • IEEE Transactions on Dependable and Secure Computing, Vol. 1, Issue 1
  • DOI: 10.1109/TDSC.2004.2

An algorithm-based error detection scheme for the multigrid method
journal, September 2003


Towards End-to-end SDC Detection for HPC Applications Equipped with Lossy Compression
conference, September 2020


Silent error detection in numerical time-stepping schemes
journal, April 2014

  • Benson, Austin R.; Schmit, Sven; Schreiber, Robert
  • The International Journal of High Performance Computing Applications, Vol. 29, Issue 4
  • DOI: 10.1177/1094342014532297

Transient-fault recovery using simultaneous multithreading
conference, January 2002

  • Vijaykumar, T. N.; Pomeranz, I.; Cheng, K.
  • Proceedings 29th Annual International Symposium on Computer Architecture
  • DOI: 10.1109/ISCA.2002.1003565

Self-stabilizing systems in spite of distributed control
journal, November 1974


From tasks graphs to asynchronous distributed checkpointing with local restart
conference, November 2020

  • Lion, Romain; Thibault, Samuel
  • 2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
  • DOI: 10.1109/FTXS51974.2020.00009

On the Analysis of Block Smoothers for Saddle Point Problems
journal, January 2018

  • Drzisga, Daniel; John, Lorenz; Rüde, Ulrich
  • SIAM Journal on Matrix Analysis and Applications, Vol. 39, Issue 2
  • DOI: 10.1137/16M1106304

Fault Tolerant Computation with the Sparse Grid Combination Technique
journal, January 2015

  • Harding, Brendan; Hegland, Markus; Larson, Jay
  • SIAM Journal on Scientific Computing, Vol. 37, Issue 3
  • DOI: 10.1137/140964448

Self-stabilizing iterative solvers
conference, January 2013

  • Sao, Piyush; Vuduc, Richard
  • Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '13
  • DOI: 10.1145/2530268.2530272

Mini-Ckpts: Surviving OS Failures in Persistent Memory
conference, June 2016

  • Fiala, David; Mueller, Frank; Ferreira, Kurt
  • ICS '16: 2016 International Conference on Supercomputing, Proceedings of the 2016 International Conference on Supercomputing
  • DOI: 10.1145/2925426.2926295

Massively Parallel Algorithms for the Lattice Boltzmann Method on NonUniform Grids
journal, January 2016

  • Schornbaum, Florian; Rüde, Ulrich
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 2
  • DOI: 10.1137/15M1035240

Algorithm-Based Fault Tolerance for Convolutional Neural Networks
journal, January 2021


A Posteriori Error Estimates Based on Hierarchical Bases
journal, August 1993

  • Bank, Randolph E.; Smith, R. Kent
  • SIAM Journal on Numerical Analysis, Vol. 30, Issue 4
  • DOI: 10.1137/0730048

Performance of asynchronous optimized Schwarz with one-sided communication
journal, August 2019


CHARM++: a portable concurrent object oriented system based on C++
conference, January 1993

  • Kale, Laxmikant V.; Krishnan, Sanjeev
  • Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications - OOPSLA '93
  • DOI: 10.1145/165854.165874

Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
conference, January 2014

  • Teranishi, Keita; Heroux, Michael A.
  • Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14
  • DOI: 10.1145/2642769.2642774

A Class of Multirate Infinitesimal GARK Methods
journal, January 2019

  • Sandu, Adrian
  • SIAM Journal on Numerical Analysis, Vol. 57, Issue 5
  • DOI: 10.1137/18M1205492

Fast Error-Bounded Lossy HPC Data Compression with SZ
conference, May 2016

  • Di, Sheng; Cappello, Franck
  • 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2016.11

Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals
journal, January 2000


Dimension?Adaptive Tensor?Product Quadrature
journal, August 2003


FlipSphere: A Software-Based DRAM Error Detection and Correction Library for HPC
conference, September 2016

  • Fiala, David; Mueller, Frank; Ferreira, Kurt B.
  • 2016 IEEE/ACM 20th International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
  • DOI: 10.1109/DS-RT.2016.27

A semi-implicit, semi-Lagrangian discontinuous Galerkin framework for adaptive numerical weather prediction: SISL-DG Framework for Adaptive NWP
journal, May 2015

  • Tumolo, Giovanni; Bonaventura, Luca
  • Quarterly Journal of the Royal Meteorological Society, Vol. 141, Issue 692
  • DOI: 10.1002/qj.2544

Recent Advances and New Avenues in Hardware-Level Reliability Support
journal, November 2005

  • Iyer, R. K.; Nakka, N. M.; Kalbarczyk, Z. T.
  • IEEE Micro, Vol. 25, Issue 6
  • DOI: 10.1109/MM.2005.119

Large-scale simulation of mantle convection based on a new matrix-free approach
journal, February 2019


Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers
conference, January 2005

  • Gioiosa, R.; Sancho, J. C.; Jiang, S.
  • ACM/IEEE SC 2005 Conference (SC'05)
  • DOI: 10.1109/SC.2005.76

Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984

  • Kuang-Hua Huang, ; Abraham, Jacob A.
  • IEEE Transactions on Computers, Vol. C-33, Issue 6
  • DOI: 10.1109/TC.1984.1676475

OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES
journal, June 2011

  • Duran, Alejandro; AyguadÉ, Eduard; Badia, Rosa M.
  • Parallel Processing Letters, Vol. 21, Issue 02
  • DOI: 10.1142/S0129626411000151

Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra
conference, May 2016

  • Wu, Panruo; Guan, Qiang; DeBardeleben, Nathan
  • HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
  • DOI: 10.1145/2907294.2907315

SWIFT: Software Implemented Fault Tolerance
conference, January 2005

  • Reis, G. A.; Chang, J.; Vachharajani, N.
  • International Symposium on Code Generation and Optimization
  • DOI: 10.1109/CGO.2005.34

Interpolation-Restart Strategies for Resilient Eigensolvers
journal, January 2016

  • Agullo, E.; Giraud, L.; Salas, P.
  • SIAM Journal on Scientific Computing, Vol. 38, Issue 5
  • DOI: 10.1137/15M1042115

Parallel asynchronous algorithms: A survey
journal, November 2020


Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications
conference, May 2017

  • Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad
  • 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
  • DOI: 10.1109/CCGRID.2017.40

Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors
conference, May 2016

  • Benoit, Anne; Cavelan, Aurelien; Robert, Yves
  • 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
  • DOI: 10.1109/IPDPS.2016.39

A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers
journal, July 2018


Towards Textbook Efficiency for Parallel Multigrid
journal, February 2015

  • Gmeiner, Björn; Rüde, Ulrich; Stengel, Holger
  • Numerical Mathematics: Theory, Methods and Applications, Vol. 8, Issue 1
  • DOI: 10.4208/nmtma.2015.w10si

Symmetric active/active metadata service for high availability parallel file systems
journal, December 2009

  • He, Xubin; Ou, Li; Engelmann, Christian
  • Journal of Parallel and Distributed Computing, Vol. 69, Issue 12
  • DOI: 10.1016/j.jpdc.2009.08.004

Verificarlo: Checking Floating Point Accuracy through Monte Carlo Arithmetic
conference, July 2016

  • Denis, Christophe; De Oliveira Castro, Pablo; Petit, Eric
  • 2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)
  • DOI: 10.1109/ARITH.2016.31

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
conference, January 2016

  • Tao, Dingwen; Song, Shuaiwen Leon; Krishnamoorthy, Sriram
  • Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16
  • DOI: 10.1145/2907294.2907306

Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing
journal, November 2015


Parallel adaptive FETI‐DP using lightweight asynchronous dynamic load balancing
journal, October 2019

  • Klawonn, Axel; Kühn, Martin J.; Rheinbach, Oliver
  • International Journal for Numerical Methods in Engineering, Vol. 121, Issue 4
  • DOI: 10.1002/nme.6237

Distributed asynchronous computation of fixed points
journal, September 1983

  • Bertsekas, Dimitri P.
  • Mathematical Programming, Vol. 27, Issue 1
  • DOI: 10.1007/bf02591967

Multivariate Quadrature on Adaptive Sparse Grids
journal, August 2003


Chaotic relaxation
journal, April 1969


Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

  • Bosilca, George; Delmas, Rémi; Dongarra, Jack
  • Journal of Parallel and Distributed Computing, Vol. 69, Issue 4
  • DOI: 10.1016/j.jpdc.2008.12.002

ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability
journal, January 2012


The Lanczos and conjugate gradient algorithms in finite precision arithmetic
journal, May 2006


Berkeley lab checkpoint/restart (BLCR) for Linux clusters
journal, September 2006


Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
conference, June 2014

  • Hassani, Amin; Skjellum, Anthony; Brightwell, Ron
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • DOI: 10.1109/dsn.2014.78

Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes
conference, November 2019

  • Pachajoa, Carlos; Pacher, Christina; Gansterer, Wilfried N.
  • 2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
  • DOI: 10.1109/ftxs49593.2019.00009

Evaluating the Impact of SDC on the GMRES Iterative Solver
conference, May 2014

  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/ipdps.2014.123

Dynamic load balancing and efficient load estimators for asynchronous iterative algorithms
journal, April 2005

  • Bahi, J. M.; Contassot-Vivier, S.; Couturier, R.
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 16, Issue 4
  • DOI: 10.1109/tpds.2005.45

On Soft Errors in the Conjugate Gradient Method: Sensitivity and Robust Numerical Detection
journal, January 2020

  • Agullo, Emmanuel; Cools, Siegfried; Yetkin, Emrullah Fatih
  • SIAM Journal on Scientific Computing, Vol. 42, Issue 6
  • DOI: 10.1137/18m122858x

Regression with the optimised combination technique
conference, January 2006

  • Garcke, Jochen
  • Proceedings of the 23rd international conference on Machine learning - ICML '06
  • DOI: 10.1145/1143844.1143885

Fault Tolerance in the Parareal Method
conference, May 2016

  • Nielsen, Allan S.; Hesthaven, Jan S.
  • HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale
  • DOI: 10.1145/2909428.2909431

PapyrusKV: a high-performance parallel key-value store for distributed NVM architectures
conference, January 2017

  • Kim, Jungwon; Lee, Seyong; Vetter, Jeffrey S.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • DOI: 10.1145/3126908.3126943

A dimension adaptive sparse grid combination technique for machine learning
journal, April 2007


A Pattern Language for High-Performance Computing Resilience
text, January 2017


Algorithm-Based Fault Tolerance for Parallel Stencil Computations
preprint, January 2019