Resiliency in numerical algorithm design for extreme scale simulations

Agullo, Emmanuel; Altenbernd, Mirco; Anzt, Hartwig; Bautista-Gomez, Leonardo; Benacchio, Tommaso; Bonaventura, Luca; Bungartz, Hans-Joachim; Chatterjee, Sanjay; Ciorba, Florina M.; DeBardeleben, Nathan; Drzisga, Daniel; Eibl, Sebastian; Engelmann, Christian; Gansterer, Wilfried N.; Giraud, Luc; Göddeke, Dominik; Heisig, Marco; Jézéquel, Fabienne; Kohl, Nils; Li, Xiaoye Sherry; Lion, Romain; Mehl, Miriam; Mycek, Paul; Obersteiner, Michael; Quintana-Ortí, Enrique S.; Rizzi, Francesco; Rüde, Ulrich; Schulz, Martin; Fung, Fred; Speck, Robert; Stals, Linda; Teranishi, Keita; Thibault, Samuel; Thönnes, Dominik; Wagner, Andreas; Wohlmuth, Barbara

doi:10.1177/10943420211055188

Resiliency in numerical algorithm design for extreme scale simulations

Journal Article · Fri Dec 10 04:00:00 EST 2021 · International Journal of High Performance Computing Applications

DOI:https://doi.org/10.1177/10943420211055188· OSTI ID:1855669

Agullo, Emmanuel ^[1]; Altenbernd, Mirco ^[2]; Anzt, Hartwig ^[3]; Bautista-Gomez, Leonardo ^[4]; Benacchio, Tommaso ^[5]; Bonaventura, Luca ^[5]; Bungartz, Hans-Joachim ^[6]; Chatterjee, Sanjay ^[7]; Ciorba, Florina M. ^[8]; DeBardeleben, Nathan ^[9]; Drzisga, Daniel ^[6]; Eibl, Sebastian ^[10]; Engelmann, Christian ^[11]; Gansterer, Wilfried N. ^[12]; Giraud, Luc ^[1]; Göddeke, Dominik ^[2]; Heisig, Marco ^[10]; Jézéquel, Fabienne ^[13]; Kohl, Nils ^[10]; Li, Xiaoye Sherry ^[14] more »

National Institute for Research in Digital Science and Technology (Inria), Rocquencourt (France)
Univ. of Stuttgart (Germany)
Karlsruher Institute of Technology (Germany)
Barcelona Supercomputing Center (Spain)
Polytechnic Univ. of Milan (Italy)
Technical Univ. of Munich (Germany)
NVIDIA Corporation, Santa Clara, CA (United States)
Univ. of Basel (Switzerland)
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Univ. of Erlangen, Nuremberg (Germany)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Univ. of Vienna (Austria)
Paris-Pantheon-Assas Univ., Paris (France)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Univ. of Bordeaux (France)
Cerfacs, Toulouse (France)
Polytechnic Univ. of Valencia (UPV) (Spain)
NexGen Analytics, Sheridan, WY (United States)
Univ. of Erlangen, Nuremberg (Germany); Cerfacs, Toulouse (France)
Australian National Univ., Canberra, ACT (Australia)
Forschungszentrum Jülich GmbH (Germany)
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Here this work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 10²³ floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.

View Accepted Manuscript (DOE)

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States); Los Alamos National Laboratory (LANL), Los Alamos, NM (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)

Sponsoring Organization:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

Grant/Contract Number:: AC05-00OR22725

OSTI ID:: 1855669

Journal Information:: International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 2 Vol. 36; ISSN 1094-3420

Publisher:: SAGECopyright Statement

Country of Publication:: United States

Language:: English

References (203)

Parallel adaptive FETI‐DP using lightweight asynchronous dynamic load balancing Klawonn, Axel; Kühn, Martin J.; Rheinbach, Oliver International Journal for Numerical Methods in Engineering, Vol. 121, Issue 4 https://doi.org/10.1002/nme.6237	journal	October 2019
Distributed asynchronous computation of fixed points Bertsekas, Dimitri P. Mathematical Programming, Vol. 27, Issue 1 https://doi.org/10.1007/bf02591967	journal	September 1983
Multivariate Quadrature on Adaptive Sparse Grids Bungartz, H. -J.; Dirnstorfer, S. Computing, Vol. 71, Issue 1 https://doi.org/10.1007/s00607-003-0016-4	journal	August 2003
Chaotic relaxation Chazan, D.; Miranker, W. Linear Algebra and its Applications, Vol. 2, Issue 2 https://doi.org/10.1016/0024-3795(69)90028-7	journal	April 1969
Algorithm-based fault tolerance applied to high performance computing Bosilca, George; Delmas, Rémi; Dongarra, Jack Journal of Parallel and Distributed Computing, Vol. 69, Issue 4 https://doi.org/10.1016/j.jpdc.2008.12.002	journal	April 2009
ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability George, Cijo; Vadhiyar, Sathish S. Procedia Computer Science, Vol. 9 https://doi.org/10.1016/j.procs.2012.04.018	journal	January 2012
The Lanczos and conjugate gradient algorithms in finite precision arithmetic Meurant, Gérard; Strakoš, Zdeněk Acta Numerica, Vol. 15 https://doi.org/10.1017/s096249290626001x	journal	May 2006
Berkeley lab checkpoint/restart (BLCR) for Linux clusters Hargrove, Paul H.; Duell, Jason C. Journal of Physics: Conference Series, Vol. 46 https://doi.org/10.1088/1742-6596/46/1/067	journal	September 2006
Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI Hassani, Amin; Skjellum, Anthony; Brightwell, Ron 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/dsn.2014.78	conference	June 2014
Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes Pachajoa, Carlos; Pacher, Christina; Gansterer, Wilfried N. 2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) https://doi.org/10.1109/ftxs49593.2019.00009	conference	November 2019
Evaluating the Impact of SDC on the GMRES Iterative Solver Elliott, James; Hoemmen, Mark; Mueller, Frank 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/ipdps.2014.123	conference	May 2014
Dynamic load balancing and efficient load estimators for asynchronous iterative algorithms Bahi, J. M.; Contassot-Vivier, S.; Couturier, R. IEEE Transactions on Parallel and Distributed Systems, Vol. 16, Issue 4 https://doi.org/10.1109/tpds.2005.45	journal	April 2005
On Soft Errors in the Conjugate Gradient Method: Sensitivity and Robust Numerical Detection Agullo, Emmanuel; Cools, Siegfried; Yetkin, Emrullah Fatih SIAM Journal on Scientific Computing, Vol. 42, Issue 6 https://doi.org/10.1137/18m122858x	journal	January 2020
Regression with the optimised combination technique Garcke, Jochen Proceedings of the 23rd international conference on Machine learning - ICML '06 https://doi.org/10.1145/1143844.1143885	conference	January 2006
Fault Tolerance in the Parareal Method Nielsen, Allan S.; Hesthaven, Jan S. HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale https://doi.org/10.1145/2909428.2909431	conference	May 2016
PapyrusKV: a high-performance parallel key-value store for distributed NVM architectures Kim, Jungwon; Lee, Seyong; Vetter, Jeffrey S. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126943	conference	January 2017
Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System Moody, A.; Bronevetsky, G.; Mohror, K. https://doi.org/10.2172/984082	report	April 2010
A dimension adaptive sparse grid combination technique for machine learning Garcke, Jochen ANZIAM Journal, Vol. 48 https://doi.org/10.21914/anziamj.v48i0.70	journal	April 2007
A Pattern Language for High-Performance Computing Resilience Hukerikar, Saurabh; Engelmann, Christian arXiv https://doi.org/10.48550/arxiv.1710.09074	text	January 2017
rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Parallel Independent Tasks Mohammed, Ali; Cavelan, Aurelien; Ciorba, Florina M. arXiv https://doi.org/10.48550/arxiv.1905.08073	preprint	January 2019
Algorithm-Based Fault Tolerance for Parallel Stencil Computations Cavelan, Aurélien; Ciorba, Florina M. arXiv https://doi.org/10.48550/arxiv.1909.00709	preprint	January 2019
CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications: CPPC: COMPILER-ASSISTED PORTABLE CHECKPOINTING Rodríguez, Gabriel; Martín, María J.; González, Patricia Concurrency and Computation: Practice and Experience, Vol. 22, Issue 6 https://doi.org/10.1002/cpe.1541	journal	November 2009
Numerical recovery strategies for parallel resilient Krylov linear solvers: RESILIENCY IN KRYLOV LINEAR SOLVERS Agullo, Emmanuel; Giraud, Luc; Guermouche, Abdou Numerical Linear Algebra with Applications, Vol. 23, Issue 5 https://doi.org/10.1002/nla.2059	journal	August 2016
A method of finite element tearing and interconnecting and its parallel solution algorithm Farhat, Charbel; Roux, Francois-Xavier International Journal for Numerical Methods in Engineering, Vol. 32, Issue 6 https://doi.org/10.1002/nme.1620320604	journal	October 1991
A semi-implicit, semi-Lagrangian discontinuous Galerkin framework for adaptive numerical weather prediction: SISL-DG Framework for Adaptive NWP Tumolo, Giovanni; Bonaventura, Luca Quarterly Journal of the Royal Meteorological Society, Vol. 141, Issue 692 https://doi.org/10.1002/qj.2544	journal	May 2015
Asynchronous Iterative Algorithms with Flexible Communication for Nonlinear Network Flow Problems El Baz, Didier; Spiteri, Pierre; Miellou, Jean Claude Journal of Parallel and Distributed Computing, Vol. 38, Issue 1 https://doi.org/10.1006/jpdc.1996.0124	journal	October 1996
Multirate linear multistep methods Gear, C. W.; Wells, D. R. BIT, Vol. 24, Issue 4 https://doi.org/10.1007/BF01934907	journal	December 1984
Distributed asynchronous computation of fixed points Bertsekas, Dimitri P. Mathematical Programming, Vol. 27, Issue 1 https://doi.org/10.1007/BF02591967	journal	September 1983
Asynchronous optimized Schwarz methods with and without overlap Magoulès, Frédéric; Szyld, Daniel B.; Venet, Cédric Numerische Mathematik, Vol. 137, Issue 1 https://doi.org/10.1007/s00211-017-0872-z	journal	March 2017
Dimension?Adaptive Tensor?Product Quadrature Gerstner, T.; Griebel, M. Computing, Vol. 71, Issue 1 https://doi.org/10.1007/s00607-003-0015-5	journal	August 2003
A multirate time stepping strategy for stiff ordinary differential equations Savcenco, V.; Hundsdorfer, W.; Verwer, J. G. BIT Numerical Mathematics, Vol. 47, Issue 1 https://doi.org/10.1007/s10543-006-0095-7	journal	November 2006
A conservative implicit multirate method for hyperbolic problems Delpopolo Carciopolo, Ludovica; Bonaventura, Luca; Scotti, Anna Computational Geosciences, Vol. 23, Issue 4 https://doi.org/10.1007/s10596-018-9764-2	journal	August 2018
A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers Cantwell, Chris D.; Nielsen, Allan S. Journal of Scientific Computing, Vol. 78, Issue 1 https://doi.org/10.1007/s10915-018-0778-7	journal	July 2018
On asynchronous iterations Frommer, Andreas; Szyld, Daniel B. Journal of Computational and Applied Mathematics, Vol. 123, Issue 1-2 https://doi.org/10.1016/S0377-0427(00)00409-X	journal	November 2000
Parallel asynchronous algorithms: A survey Spiteri, Pierre Advances in Engineering Software, Vol. 149 https://doi.org/10.1016/j.advengsoft.2020.102896	journal	November 2020
The GeoClaw software for depth-averaged flows with adaptive refinement Berger, Marsha J.; George, David L.; LeVeque, Randall J. Advances in Water Resources, Vol. 34, Issue 9 https://doi.org/10.1016/j.advwatres.2011.02.016	journal	September 2011
A two-scale approach for efficient on-the-fly operator assembly in massively parallel high performance multigrid codes Bauer, S.; Mohr, M.; Rüde, U. Applied Numerical Mathematics, Vol. 122 https://doi.org/10.1016/j.apnum.2017.07.006	journal	December 2017
A self adjusting multirate algorithm for robust time discretization of partial differential equations Bonaventura, L.; Casella, F.; Carciopolo, L. Delpopolo Computers & Mathematics with Applications, Vol. 79, Issue 7 https://doi.org/10.1016/j.camwa.2019.11.023	journal	April 2020
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing Casanova, Henri; Robert, Yves; Vivien, Frédéric Future Generation Computer Systems, Vol. 51 https://doi.org/10.1016/j.future.2015.04.003	journal	October 2015
Local rollback for resilient MPI applications with application-level checkpointing and message logging Losada, Nuria; Bosilca, George; Bouteiller, Aurélien Future Generation Computer Systems, Vol. 91 https://doi.org/10.1016/j.future.2018.09.041	journal	February 2019
Comparison between adaptive and uniform discontinuous Galerkin simulations in dry 2D bubble experiments Müller, Andreas; Behrens, Jörn; Giraldo, Francis X. Journal of Computational Physics, Vol. 235 https://doi.org/10.1016/j.jcp.2012.10.038	journal	February 2013
An efficient parallel implementation of explicit multirate Runge–Kutta schemes for discontinuous Galerkin computations Seny, Bruno; Lambrechts, Jonathan; Toulorge, Thomas Journal of Computational Physics, Vol. 256 https://doi.org/10.1016/j.jcp.2013.07.041	journal	January 2014
Scalable and fault tolerant orthogonalization based on randomized distributed data aggregation Gansterer, Wilfried N.; Niederbrucker, Gerhard; Straková, Hana Journal of Computational Science, Vol. 4, Issue 6 https://doi.org/10.1016/j.jocs.2013.01.006	journal	November 2013
Fine-grained bit-flip protection for relaxation methods Anzt, Hartwig; Dongarra, Jack; Quintana-Ortí, Enrique S. Journal of Computational Science, Vol. 36 https://doi.org/10.1016/j.jocs.2016.11.013	journal	September 2019
Large-scale simulation of mantle convection based on a new matrix-free approach Bauer, S.; Huber, M.; Ghelichkhan, S. Journal of Computational Science, Vol. 31 https://doi.org/10.1016/j.jocs.2018.12.006	journal	February 2019
Symmetric active/active metadata service for high availability parallel file systems He, Xubin; Ou, Li; Engelmann, Christian Journal of Parallel and Distributed Computing, Vol. 69, Issue 12 https://doi.org/10.1016/j.jpdc.2009.08.004	journal	December 2009
Proactive process-level live migration and back migration in HPC environments Wang, Chao; Mueller, Frank; Engelmann, Christian Journal of Parallel and Distributed Computing, Vol. 72, Issue 2, p. 254-267 https://doi.org/10.1016/j.jpdc.2011.10.009	journal	February 2012
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel Journal of Parallel and Distributed Computing, Vol. 74, Issue 12 https://doi.org/10.1016/j.jpdc.2014.07.003	journal	December 2014
Fault tolerant communication-optimal 2.5D matrix multiplication Moldaschl, Michael; Prikopa, Karl E.; Gansterer, Wilfried N. Journal of Parallel and Distributed Computing, Vol. 104 https://doi.org/10.1016/j.jpdc.2017.01.022	journal	June 2017
Fault-tolerant least squares solvers for wireless sensor networks based on gossiping Prikopa, Karl E.; Gansterer, Wilfried N. Journal of Parallel and Distributed Computing, Vol. 136 https://doi.org/10.1016/j.jpdc.2019.09.006	journal	February 2020
Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing Göddeke, Dominik; Altenbernd, Mirco; Ribbrock, Dirk Parallel Computing, Vol. 49 https://doi.org/10.1016/j.parco.2015.07.003	journal	November 2015
Toward fault-tolerant parallel-in-time integration with PFASST Speck, Robert; Ruprecht, Daniel Parallel Computing, Vol. 62 https://doi.org/10.1016/j.parco.2016.12.001	journal	February 2017
Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner Rizzi, F.; Morris, K.; Sargsyan, K. Parallel Computing, Vol. 73 https://doi.org/10.1016/j.parco.2017.05.005	journal	April 2018
Performance of asynchronous optimized Schwarz with one-sided communication Yamazaki, Ichitaro; Chow, Edmond; Bouteiller, Aurelien Parallel Computing, Vol. 86 https://doi.org/10.1016/j.parco.2019.05.004	journal	August 2019
Fault Tolerance Properties of Gossip-Based Distributed Orthogonal Iteration Methods Straková, Hana; Niederbrucker, Gerhard; Gansterer, Wilfried N. Procedia Computer Science, Vol. 18 https://doi.org/10.1016/j.procs.2013.05.182	journal	January 2013
Sparse grids Bungartz, Hans-Joachim; Griebel, Michael Acta Numerica, Vol. 13 https://doi.org/10.1017/S0962492904000182	journal	May 2004
The Lanczos and conjugate gradient algorithms in finite precision arithmetic Meurant, Gérard; Strakoš, Zdeněk Acta Numerica, Vol. 15 https://doi.org/10.1017/S096249290626001X	journal	May 2006
Tsunami modelling with adaptively refined finite volume methods LeVeque, Randall J.; George, David L.; Berger, Marsha J. Acta Numerica, Vol. 20 https://doi.org/10.1017/S0962492911000043	journal	April 2011
Discrete Stochastic Arithmetic for Validating Results of Numerical Software Vignes, Jean Numerical Algorithms, Vol. 37, Issue 1-4 https://doi.org/10.1023/B:NUMA.0000049483.75679.ce	journal	December 2004
Stochastic subspace correction methods and fault tolerance Griebel, Michael; Oswald, Peter Mathematics of Computation, Vol. 89, Issue 321 https://doi.org/10.1090/mcom/3459	journal	August 2019
Anisotropic mesh adaptivity for multi-scale ocean modelling Piggott, M. D.; Farrell, P. E.; Wilson, C. R. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 367, Issue 1907 https://doi.org/10.1098/rsta.2009.0155	journal	November 2009
Error detection by duplicated instructions in super-scalar processors Oh, N.; Shirvani, P. P.; McCluskey, E. J. IEEE Transactions on Reliability, Vol. 51, Issue 1 https://doi.org/10.1109/24.994913	journal	March 2002
Verificarlo: Checking Floating Point Accuracy through Monte Carlo Arithmetic Denis, Christophe; De Oliveira Castro, Pablo; Petit, Eric 2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH) https://doi.org/10.1109/ARITH.2016.31	conference	July 2016
Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets Liang, Xin; Di, Sheng; Tao, Dingwen 2018 IEEE International Conference on Big Data (Big Data) https://doi.org/10.1109/BigData.2018.8622520	conference	December 2018
Dynamic Malleability in Iterative MPI Applications El Maghraoui, Kaoutar; Desell, Travis J.; Szymanski, Boleslaw K. Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07) https://doi.org/10.1109/CCGRID.2007.45	conference	May 2007
Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) https://doi.org/10.1109/CCGRID.2017.40	conference	May 2017
Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations Cavelan, Aurelien; Cabezon, Ruben M.; Ciorba, Florina M. 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) https://doi.org/10.1109/CCGRID.2019.00013	conference	May 2019
Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets Keller, Kai; Bautista-Gomez, Leonardo 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) https://doi.org/10.1109/CCGRID.2019.00015	conference	May 2019
SWIFT: Software Implemented Fault Tolerance Reis, G. A.; Chang, J.; Vachharajani, N. International Symposium on Code Generation and Optimization https://doi.org/10.1109/CGO.2005.34	conference	January 2005
A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad 2016 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2016.54	conference	September 2016
An ABFT Scheme Based on Communication Characteristics Kabir, Upama; Goswami, Dhrubajyoti 2016 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2016.68	conference	September 2016
MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection Subasi, Omer; Di, Sheng; Balaprakash, Prasanna 2017 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2017.128	conference	September 2017
Algorithm-Based Fault Tolerance for Parallel Stencil Computations Cavelan, Aurelien; Ciorba, Florina M. 2019 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2019.8891034	conference	September 2019
Towards End-to-end SDC Detection for HPC Applications Equipped with Lossy Compression Li, Sihuan; Di, Sheng; Zhao, Kai 2020 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER49012.2020.00043	conference	September 2020
Debugging and Optimization of HPC Programs with the Verrou Tool Fevotte, Francois; Lathuiliere, Bruno 2019 IEEE/ACM 3rd International Workshop on Software Correctness for HPC Applications (Correctness) https://doi.org/10.1109/Correctness49594.2019.00006	conference	November 2019
FlipSphere: A Software-Based DRAM Error Detection and Correction Library for HPC Fiala, David; Mueller, Frank; Ferreira, Kurt B. 2016 IEEE/ACM 20th International Symposium on Distributed Simulation and Real Time Applications (DS-RT) https://doi.org/10.1109/DS-RT.2016.27	conference	September 2016
A fault tolerant approach to microprocessor design Weaver, C.; Austin, T. Proceedings International Conference on Dependable Systems and Networks https://doi.org/10.1109/DSN.2001.941425	conference	January 2001
Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI Hassani, Amin; Skjellum, Anthony; Brightwell, Ron 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://doi.org/10.1109/DSN.2014.78	conference	June 2014
Does partial replication pay off? Stearley, Jon; Ferreira, Kurt; Robinson, David 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264669	conference	June 2012
ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264672	conference	June 2012
A scalable double in-memory checkpoint and restart scheme towards exascale Zheng, Gengbin; Kale, Laxmikant V. 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264677	conference	June 2012
Improving Application Resilience by Extending Error Correction with Contextual Information Poulos, Alexandra; Wallace, Dylan; Robey, Robert 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) https://doi.org/10.1109/FTXS.2018.00006	conference	November 2018
Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods Pachajoa, Carlos; Levonyak, Markus; Gansterer, Wilfried N. 2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) https://doi.org/10.1109/FTXS.2018.00009	conference	November 2018
Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes Pachajoa, Carlos; Pacher, Christina; Gansterer, Wilfried N. 2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) https://doi.org/10.1109/FTXS49593.2019.00009	conference	November 2019
From tasks graphs to asynchronous distributed checkpointing with local restart Lion, Romain; Thibault, Samuel 2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) https://doi.org/10.1109/FTXS51974.2020.00009	conference	November 2020
Supporting highly-decoupled thread-level redundancy for parallel programs Rashid, M. Wasiur; Huang, Michael C. 2008 IEEE 14th International Symposium on High Performance Computer Architecture (HPCA) https://doi.org/10.1109/HPCA.2008.4658655	conference	February 2008
rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks Mohammed, Ali; Cavelan, Aurelien; Ciorba, Florina M. 2019 International Conference on High Performance Computing & Simulation (HPCS) https://doi.org/10.1109/HPCS48598.2019.9188153	conference	July 2019
A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique Ali, Md Mohsin; Strazdins, Peter E.; Harding, Brendan 2015 International Conference on High Performance Computing & Simulation (HPCS) https://doi.org/10.1109/HPCSim.2015.7237082	conference	July 2015
An evaluation of lazy fault detection based on Adaptive Redundant Multithreading Hukerikar, Saurabh; Teranishi, Keita; Diniz, Pedro C. 2014 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC.2014.7040999	conference	September 2014
The Open Community Runtime: A runtime system for extreme scale computing Mattson, Timothy G.; Cledat, Romain; Cave, Vincent 2016 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC.2016.7761580	conference	September 2016
On the Resilience of Parallel Sparse Hybrid Solvers Agullo, Emmanuel; Giraud, Luc; Zounon, Mawussi 2015 IEEE 22nd International Conference on High Performance Computing (HiPC) https://doi.org/10.1109/HiPC.2015.9	conference	December 2015
A SIMD-based software fault tolerance for ARM processors Lin, Shun-Zhi; Chen, Peng-Sheng 2017 International Conference on Applied System Innovation (ICASI) https://doi.org/10.1109/ICASI.2017.7988587	conference	May 2017
Combining Partial Redundancy and Checkpointing for HPC Elliott, James; Kharbas, Kishor; Fiala, David 2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS) https://doi.org/10.1109/ICDCS.2012.56	conference	June 2012
Hybrid Checkpointing for MPI Jobs in HPC Environments Wang, Chao; Mueller, Frank; Engelmann, Christian 2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS) https://doi.org/10.1109/ICPADS.2010.48	conference	December 2010
Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems Engelmann, Christian; Naughton, Thomas 2013 42nd International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2013.114	conference	October 2013
Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques Gamell, Marc; Katz, Daniel S.; Teranishi, Keita 2016 45th International Conference on Parallel Processing Workshops (ICPPW) https://doi.org/10.1109/ICPPW.2016.56	conference	August 2016
A PIN-Based Dynamic Software Fault Injection System Jin, Ang; Jiang, Jianhui; Hu, Jiawei 2008 9th International Conference for Young Computer Scientists (ICYCS), 2008 The 9th International Conference for Young Computer Scientists https://doi.org/10.1109/ICYCS.2008.329	conference	November 2008
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Wang, Chao; Mueller, Frank; Engelmann, Christian 2007 IEEE International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2007.370307	conference	March 2007
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Di, Sheng; Bouguerra, Mohamed Slim; Bautista-Gomez, Leonardo 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.122	conference	May 2014
Evaluating the Impact of SDC on the GMRES Iterative Solver Elliott, James; Hoemmen, Mark; Mueller, Frank 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.123	conference	May 2014
F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability Guan, Qiang; Debardeleben, Nathan; Blanchard, Sean 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.128	conference	May 2014
Fast Error-Bounded Lossy HPC Data Compression with SZ Di, Sheng; Cappello, Franck 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2016.11	conference	May 2016
Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors Benoit, Anne; Cavelan, Aurelien; Robert, Yves 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2016.39	conference	May 2016
Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization Tao, Dingwen; Di, Sheng; Chen, Zizhong 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2017.115	conference	May 2017
Asynchronous Multigrid Methods Wolfson-Pou, Jordi; Chow, Edmond 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2019.00021	conference	May 2019
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale Nicolae, Bogdan; Moody, Adam; Gonsiorowski, Elsa 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) https://doi.org/10.1109/IPDPS.2019.00099	conference	May 2019
Transient-fault recovery using simultaneous multithreading Vijaykumar, T. N.; Pomeranz, I.; Cheng, K. Proceedings 29th Annual International Symposium on Computer Architecture https://doi.org/10.1109/ISCA.2002.1003565	conference	January 2002
SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation Hari, Siva Kumar Sastry; Tsai, Timothy; Stephenson, Mark 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) https://doi.org/10.1109/ISPASS.2017.7975296	conference	April 2017
Investigating the Resilience of Dynamic Loop Scheduling in Heterogeneous Computing Systems Sukhija, Nitin; Banicescu, Ioana; Ciorba, Florina M. 2015 14th International Symposium on Parallel and Distributed Computing (ISPDC) https://doi.org/10.1109/ISPDC.2015.29	conference	June 2015
DIVA: a reliable substrate for deep submicron microarchitecture design Austin, T. M. MICRO-32. 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture https://doi.org/10.1109/MICRO.1999.809458	conference	January 1999
Recent Advances and New Avenues in Hardware-Level Reliability Support Iyer, R. K.; Nakka, N. M.; Kalbarczyk, Z. T. IEEE Micro, Vol. 25, Issue 6 https://doi.org/10.1109/MM.2005.119	journal	November 2005
RAJA: Portable Performance for Large-Scale Scientific Applications Beckingsale, David A.; Scogland, Thomas RW; Burmark, Jason 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) https://doi.org/10.1109/P3HPC49587.2019.00012	conference	November 2019
Proactive Fault Tolerance Using Preemptive Migration Engelmann, Christian; Vallee, Geoffroy R.; Naughton, Thomas 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing https://doi.org/10.1109/PDP.2009.31	conference	February 2009
NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart Subasi, Omer; Arias, Javier; Unsal, Osman 2015 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing https://doi.org/10.1109/PDP.2015.17	conference	March 2015
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers Gioiosa, R.; Sancho, J. C.; Jiang, S. ACM/IEEE SC 2005 Conference (SC'05) https://doi.org/10.1109/SC.2005.76	conference	January 2005
Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI Coti, Camille; Herault, Thomas; Lemarinier, Pierre SC 2006 Proceedings Supercomputing 2006, ACM/IEEE SC 2006 Conference (SC'06) https://doi.org/10.1109/SC.2006.15	conference	November 2006
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.18	conference	November 2010
Design and modeling of a non-blocking checkpointing system Sato, Kento; Maruyama, Naoya; Mohror, Kathryn 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.46	conference	November 2012
Detection and correction of silent data corruption for large-scale high-performance computing Fiala, David; Mueller, Frank; Engelmann, Christian 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.49	conference	November 2012
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78	conference	November 2014
FlipBack: Automatic Targeted Protection against Silent Data Corruption Ni, Xiang; Kale, Laxmikant V. SC16: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2016.28	conference	November 2016
Programming Models and Development Software for a Space-Based Many-Core Processor Crago, Stephen P.; Kang, Dong-In; Kang, Mikyung 2011 IEEE International Conference on Space Mission Challenges for Information Technology (SMC-IT), 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology https://doi.org/10.1109/SMC-IT.2011.29	conference	August 2011
Reduced Triple Modular redundancy for built-in self-repair in VLIW-processors Scholzel, Mario 2007 Signal Processing Algorithms, Architectures, Arrangements, and Applications (SPA 2007), Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA 2007 https://doi.org/10.1109/SPA.2007.5903294	conference	September 2007
Performance Scaling Variability and Energy Analysis for a Resilient ULFM-based PDE Solver Morris, K.; Rizzi, F.; Cook, B. 2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) https://doi.org/10.1109/ScalA.2016.010	conference	November 2016
Algorithm-Based Fault Tolerance for Matrix Operations No authors listed IEEE Transactions on Computers, Vol. C-33, Issue 6 https://doi.org/10.1109/TC.1984.1676475	journal	June 1984
An algorithm-based error detection scheme for the multigrid method Mishra, A.; Banerjee, P. IEEE Transactions on Computers, Vol. 52, Issue 9 https://doi.org/10.1109/TC.2003.1228507	journal	September 2003
Basic concepts and taxonomy of dependable and secure computing Avizienis, A.; Laprie, J. -C.; Randell, B. IEEE Transactions on Dependable and Secure Computing, Vol. 1, Issue 1 https://doi.org/10.1109/TDSC.2004.2	journal	January 2004
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications Di, Sheng; Cappello, Franck IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 10 https://doi.org/10.1109/TPDS.2016.2517639	journal	October 2016
Toward General Software Level Silent Data Corruption Detection for Parallel Applications Berrocal, Eduardo; Bautista-Gomez, Leonardo; Di, Sheng IEEE Transactions on Parallel and Distributed Systems, Vol. 28, Issue 12 https://doi.org/10.1109/TPDS.2017.2735971	journal	December 2017
An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL Tang, Xiongchao; Zhai, Jidong; Yu, Bowen IEEE Transactions on Parallel and Distributed Systems, Vol. 29, Issue 4 https://doi.org/10.1109/TPDS.2017.2781257	journal	April 2018
CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance Shahzad, Faisal; Thies, Jonas; Kreutzer, Moritz IEEE Transactions on Parallel and Distributed Systems, Vol. 30, Issue 3 https://doi.org/10.1109/TPDS.2018.2866794	journal	March 2019
Algorithm-Based Fault Tolerance for Convolutional Neural Networks Zhao, Kai; Di, Sheng; Li, Sihuan IEEE Transactions on Parallel and Distributed Systems https://doi.org/10.1109/tpds.2020.3043449	journal	January 2021
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment Langou, J.; Chen, Z.; Bosilca, G. SIAM Journal on Scientific Computing, Vol. 30, Issue 1 https://doi.org/10.1137/040620394	journal	January 2008
Fully Adaptive Multigrid Methods Rüde, Ulrich SIAM Journal on Numerical Analysis, Vol. 30, Issue 1 https://doi.org/10.1137/0730011	journal	February 1993
A Posteriori Error Estimates Based on Hierarchical Bases Bank, Randolph E.; Smith, R. Kent SIAM Journal on Numerical Analysis, Vol. 30, Issue 4 https://doi.org/10.1137/0730048	journal	August 1993
Density Estimation with Adaptive Sparse Grids for Large Data Sets Peherstorfer, Benjamin; Pflüge, Dirk; Bungartz, Hans-Joachim Proceedings of the 2014 SIAM International Conference on Data Mining https://doi.org/10.1137/1.9781611973440.51	conference	April 2014
Fault Tolerant Computation with the Sparse Grid Combination Technique Harding, Brendan; Hegland, Markus; Larson, Jay SIAM Journal on Scientific Computing, Vol. 37, Issue 3 https://doi.org/10.1137/140964448	journal	January 2015
Fine-Grained Parallel Incomplete LU Factorization Chow, Edmond; Patel, Aftab SIAM Journal on Scientific Computing, Vol. 37, Issue 2 https://doi.org/10.1137/140968896	journal	January 2015
Resilience for Massively Parallel Multigrid Solvers Huber, Markus; Gmeiner, Björn; Rüde, Ulrich SIAM Journal on Scientific Computing, Vol. 38, Issue 5 https://doi.org/10.1137/15M1026122	journal	January 2016
Massively Parallel Algorithms for the Lattice Boltzmann Method on NonUniform Grids Schornbaum, Florian; Rüde, Ulrich SIAM Journal on Scientific Computing, Vol. 38, Issue 2 https://doi.org/10.1137/15M1035240	journal	January 2016
Interpolation-Restart Strategies for Resilient Eigensolvers Agullo, E.; Giraud, L.; Salas, P. SIAM Journal on Scientific Computing, Vol. 38, Issue 5 https://doi.org/10.1137/15M1042115	journal	January 2016
Discrete A Priori Bounds for the Detection of Corrupted PDE Solutions in Exascale Computations Mycek, Paul; Rizzi, Francesco; Maître, Olivier Le SIAM Journal on Scientific Computing, Vol. 39, Issue 1 https://doi.org/10.1137/15M1051786	journal	January 2017
Is the Multigrid Method Fault Tolerant? The Multilevel Case Ainsworth, Mark; Glusa, Christian SIAM Journal on Scientific Computing, Vol. 39, Issue 6 https://doi.org/10.1137/16M1097274	journal	January 2017
On the Analysis of Block Smoothers for Saddle Point Problems Drzisga, Daniel; John, Lorenz; Rüde, Ulrich SIAM Journal on Matrix Analysis and Applications, Vol. 39, Issue 2 https://doi.org/10.1137/16M1106304	journal	January 2018
Extreme-Scale Block-Structured Adaptive Mesh Refinement Schornbaum, Florian; Rüde, Ulrich SIAM Journal on Scientific Computing, Vol. 40, Issue 3 https://doi.org/10.1137/17M1128411	journal	January 2018
A Stencil Scaling Approach for Accelerating Matrix-Free Finite Element Implementations Bauer, S.; Drzisga, D.; Mohr, M. SIAM Journal on Scientific Computing, Vol. 40, Issue 6 https://doi.org/10.1137/17M1148384	journal	January 2018
A Class of Multirate Infinitesimal GARK Methods Sandu, Adrian SIAM Journal on Numerical Analysis, Vol. 57, Issue 5 https://doi.org/10.1137/18M1205492	journal	January 2019
Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals van der Vorst, Henk A.; Ye, Qiang SIAM Journal on Scientific Computing, Vol. 22, Issue 3 https://doi.org/10.1137/S1064827599353865	journal	January 2000
OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES Duran, Alejandro; AyguadÉ, Eduard; Badia, Rosa M. Parallel Processing Letters, Vol. 21, Issue 02 https://doi.org/10.1142/S0129626411000151	journal	June 2011
Scalable, fault tolerant membership for MPI tasks on HPC systems Varma, Jyothish; Wang, Chao; Mueller, Frank Proceedings of the 20th annual international conference on Supercomputing - ICS '06 https://doi.org/10.1145/1183401.1183433	conference	January 2006
Proactive fault tolerance for HPC with Xen virtualization Nagarajan, Arun Babu; Mueller, Frank; Engelmann, Christian Proceedings of the 21st annual international conference on Supercomputing - ICS '07 https://doi.org/10.1145/1274971.1274978	conference	January 2007
CHARM++: a portable concurrent object oriented system based on C++ Kale, Laxmikant V.; Krishnan, Sanjeev Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications - OOPSLA '93 https://doi.org/10.1145/165854.165874	conference	January 1993
Characterizing the impact of soft errors on iterative methods in scientific computing Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma Proceedings of the international conference on Supercomputing - ICS '11 https://doi.org/10.1145/1995896.1995922	conference	January 2011
Algorithms and data structures for massively parallel generic adaptive finite element codes Bangerth, Wolfgang; Burstedde, Carsten; Heister, Timo ACM Transactions on Mathematical Software, Vol. 38, Issue 2 https://doi.org/10.1145/2049673.2049678	journal	December 2011
FTI: high performance fault tolerance interface for hybrid systems Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427	conference	January 2011
Robust distributed orthogonalization based on randomized aggregation Gansterer, Wilfried N.; Niederbrucker, Gerhard; Straková, Hana Proceedings of the second workshop on Scalable algorithms for large-scale systems - ScalA '11 https://doi.org/10.1145/2133173.2133177	conference	January 2011
Algorithm-based fault tolerance for dense matrix factorizations Du, Peng; Bouteiller, Aurelien; Bosilca, George ACM SIGPLAN Notices, Vol. 47, Issue 8 https://doi.org/10.1145/2370036.2145845	journal	September 2012
Parallel reduction to hessenberg form with algorithm-based fault tolerance Jia, Yulu; Bosilca, George; Luszczek, Piotr SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2503210.2503249	conference	November 2013
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing Ropars, Thomas; Martsinkevich, Tatiana V.; Guermouche, Amina SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2503210.2503271	conference	November 2013
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods Chen, Zizhong ACM SIGPLAN Notices, Vol. 48, Issue 8 https://doi.org/10.1145/2517327.2442533	journal	August 2013
Self-stabilizing iterative solvers Sao, Piyush; Vuduc, Richard Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '13 https://doi.org/10.1145/2530268.2530272	conference	January 2013
FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines Wu, Panruo; Chen, Zizhong Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14 https://doi.org/10.1145/2600212.2600232	conference	January 2014
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM Teranishi, Keita; Heroux, Michael A. Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14 https://doi.org/10.1145/2642769.2642774	conference	January 2014
MCALIB: Measuring Sensitivity to Rounding Error with Monte Carlo Programming Frechtling, Michael; Leong, Philip H. W. ACM Transactions on Programming Languages and Systems, Vol. 37, Issue 2 https://doi.org/10.1145/2665073	journal	April 2015
HPX: A Task Based Programming Model in a Global Address Space Kaiser, Hartmut; Heller, Thomas; Adelstein-Lelbach, Bryce Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models - PGAS '14 https://doi.org/10.1145/2676870.2676883	conference	January 2014
Programmer-directed partial redundancy for resilient HPC Subasi, Omer; Arias, Javier; Unsal, Osman CF'15: Computing Frontiers Conference, Proceedings of the 12th ACM International Conference on Computing Frontiers https://doi.org/10.1145/2742854.2742903	conference	May 2015
Resilient Matrix Multiplication of Hierarchical Semi-Separable Matrices Austin, Brian; Roman, Eric; Li, Xiaoye HPDC'15: The 24th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale https://doi.org/10.1145/2751504.2751507	conference	June 2015
Exploiting asynchrony from exact forward recovery for DUE in iterative solvers Jaulmes, Luc; Casas, Marc; Moretó, Miquel SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807599	conference	November 2015
VOCL-FT: introducing techniques for efficient soft error coprocessor recovery Peña, Antonio J.; Bland, Wesley; Balaji, Pavan SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2807591.2807640	conference	November 2015
Tuning stationary iterative solvers for fault resilience Anzt, Hartwig; Dongarra, Jack; Quintana-Ortí, Enrique S. Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '15 https://doi.org/10.1145/2832080.2832081	conference	January 2015
New-Sum: A Novel Online ABFT Scheme For General Iterative Methods Tao, Dingwen; Song, Shuaiwen Leon; Krishnamoorthy, Sriram Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16 https://doi.org/10.1145/2907294.2907306	conference	January 2016
Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra Wu, Panruo; Guan, Qiang; DeBardeleben, Nathan HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing https://doi.org/10.1145/2907294.2907315	conference	May 2016
ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner Rizzi, Francesco; Morris, Karla; Sargsyan, Khachik HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale https://doi.org/10.1145/2909428.2909429	conference	May 2016
Mini-Ckpts: Surviving OS Failures in Persistent Memory Fiala, David; Mueller, Frank; Ferreira, Kurt ICS '16: 2016 International Conference on Supercomputing, Proceedings of the 2016 International Conference on Supercomputing https://doi.org/10.1145/2925426.2926295	conference	June 2016
Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale Benoit, Anne; Cavelan, Aurélien; Cappello, Franck Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale - FTXS '17 https://doi.org/10.1145/3086157.3086162	conference	January 2017
Correcting soft errors online in fast fourier transform Liang, Xin; Chen, Zizhong; Chen, Jieyang Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126915	conference	January 2017
REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed Georgakoudis, Giorgis; Laguna, Ignacio; Nikolopoulos, Dimitrios S. SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126972	conference	November 2017
A Pattern Language for High-Performance Computing Resilience Hukerikar, Saurabh; Engelmann, Christian EuroPLoP '17: European Conference on Pattern Languages of Programs, Proceedings of the 22nd European Conference on Pattern Languages of Programs https://doi.org/10.1145/3147704.3147718	conference	July 2017
A highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma simulations Obersteiner, Michael; Hinojosa, Alfredo Parra; Heene, Mario SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems https://doi.org/10.1145/3148226.3148229	conference	November 2017
Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing Ashraf, Rizwan A.; Hukerikar, Saurabh; Engelmann, Christian ICPE '18: ACM/SPEC International Conference on Performance Engineering, Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering https://doi.org/10.1145/3184407.3184421	conference	March 2018
D is CV ar: discovering critical variables using algorithmic differentiation for transient faults Menon, Harshitha; Mohror, Kathryn ACM SIGPLAN Notices, Vol. 53, Issue 1 https://doi.org/10.1145/3200691.3178502	journal	March 2018
Improving performance of iterative methods by lossy checkponting Tao, Dingwen; Di, Sheng; Liang, Xin Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18 https://doi.org/10.1145/3208040.3208050	conference	January 2018
Asynchronous Iterative Methods for Multiprocessors Baudet, Gérard M. Journal of the ACM, Vol. 25, Issue 2 https://doi.org/10.1145/322063.322067	journal	April 1978
Evaluating Support for OpenMP Offload Features Diaz, Jose Monsalve; Pophale, Swaroop; Friedline, Kyle Proceedings of the 47th International Conference on Parallel Processing Companion - ICPP '18 https://doi.org/10.1145/3229710.3229717	conference	January 2018
Towards resilient EU HPC systems: a blueprint Radojkovic, Petar CF '19: Computing Frontiers Conference, Proceedings of the 16th ACM International Conference on Computing Frontiers https://doi.org/10.1145/3310273.3323434	conference	April 2019
How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures Pachajoa, Carlos; Levonyak, Markus; Gansterer, Wilfried N. ICPP 2019: 48th International Conference on Parallel Processing, Proceedings of the 48th International Conference on Parallel Processing https://doi.org/10.1145/3337821.3337849	conference	August 2019
Self-stabilizing systems in spite of distributed control Dijkstra, Edsger W. Communications of the ACM, Vol. 17, Issue 11 https://doi.org/10.1145/361179.361202	journal	November 1974
A survey of rollback-recovery protocols in message-passing systems Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min ACM Computing Surveys, Vol. 34, Issue 3 https://doi.org/10.1145/568522.568525	journal	September 2002
Toward Exascale Resilience Cappello, Franck; Geist, Al; Gropp, Bill The International Journal of High Performance Computing Applications, Vol. 23, Issue 4 https://doi.org/10.1177/1094342009347767	journal	September 2009
Silent error detection in numerical time-stepping schemes Benson, Austin R.; Schmit, Sven; Schreiber, Robert The International Journal of High Performance Computing Applications, Vol. 29, Issue 4 https://doi.org/10.1177/1094342014532297	journal	April 2014
Evaluating and extending user-level fault tolerance in MPI applications Laguna, Ignacio; Richards, David F.; Gamblin, Todd The International Journal of High Performance Computing Applications, Vol. 30, Issue 3 https://doi.org/10.1177/1094342015623623	journal	July 2016
Complex scientific applications made fault-tolerant with the sparse grid combination technique Ali, Md Mohsin; Strazdins, Peter E.; Harding, Brendan The International Journal of High Performance Computing Applications, Vol. 30, Issue 3 https://doi.org/10.1177/1094342015628056	journal	July 2016
Exploring versioned distributed arrays for resilience in scientific applications: global view resilience Chien, A.; Balaji, P.; Dun, N. The International Journal of High Performance Computing Applications, Vol. 31, Issue 6 https://doi.org/10.1177/1094342016664796	journal	September 2016
Unified fault-tolerance framework for hybrid task-parallel message-passing applications Subasi, Omer; Martsinkevich, Tatiana; Zyulkyarov, Ferad The International Journal of High Performance Computing Applications, Vol. 32, Issue 5 https://doi.org/10.1177/1094342016669416	journal	September 2016
Soft fault detection and correction for multigrid Altenbernd, Mirco; Göddeke, Dominik The International Journal of High Performance Computing Applications, Vol. 32, Issue 6 https://doi.org/10.1177/1094342016684006	journal	February 2017
Algorithm-based fault recovery of adaptively refined parallel multilevel grids Stals, Linda The International Journal of High Performance Computing Applications, Vol. 33, Issue 1 https://doi.org/10.1177/1094342017720801	journal	August 2017
Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions Casas, Marc; Gansterer, Wilfried N.; Wimmer, Elias The International Journal of High Performance Computing Applications, Vol. 33, Issue 2 https://doi.org/10.1177/1094342018762531	journal	April 2018
A scalable and extensible checkpointing scheme for massively parallel simulations Kohl, Nils; Hötzer, Johannes; Schornbaum, Florian The International Journal of High Performance Computing Applications, Vol. 33, Issue 4 https://doi.org/10.1177/1094342018767736	journal	May 2018
Adaptive control in roll-forward recovery for extreme scale multigrid Huber, Markus; Rüde, Ulrich; Wohlmuth, Barbara The International Journal of High Performance Computing Applications, Vol. 33, Issue 5 https://doi.org/10.1177/1094342018817088	journal	December 2018
Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction Benacchio, Tommaso; Bonaventura, Luca; Altenbernd, Mirco The International Journal of High Performance Computing Applications, Vol. 35, Issue 4 https://doi.org/10.1177/1094342021990433	journal	February 2021
Achieving algorithmic resilience for temporal integration through spectral deferred corrections Grout, Ray; Kolla, Hemanth; Minion, Michael Communications in Applied Mathematics and Computational Science, Vol. 12, Issue 1 https://doi.org/10.2140/camcos.2017.12.25	journal	January 2017
Towards Textbook Efficiency for Parallel Multigrid Gmeiner, Björn; Rüde, Ulrich; Stengel, Holger Numerical Mathematics: Theory, Methods and Applications, Vol. 8, Issue 1 https://doi.org/10.4208/nmtma.2015.w10si	journal	February 2015
Methods of conjugate gradients for solving linear systems Hestenes, M. R.; Stiefel, E. Journal of Research of the National Bureau of Standards, Vol. 49, Issue 6 https://doi.org/10.6028/jres.049.044	journal	December 1952

Similar Records

Design for a Soft Error Resilient Dynamic Task-Based Runtime, In: 2015 IEEE International Parallel and Distributed Processing Symposium

Conference · Fri May 01 00:00:00 EDT 2015 · 2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS) · OSTI ID:1567397

Report on the Dagstuhl Seminar on Visualization and Monitoring of Network Traffic

Journal Article · Tue Jan 25 23:00:00 EST 2011 · Journal of Network and Systems Management, 18(2):232-236 · OSTI ID:1007354

Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact. Final report

Technical Report · Thu Apr 16 00:00:00 EDT 2020 · OSTI ID:1615150

Related Subjects

79 ASTRONOMY AND ASTROPHYSICS
fault tolerance
numerical algorithms
parallel computer architecture
resilience

Resiliency in numerical algorithm design for extreme scale simulations

Citation Formats

References (203)

Similar Records

Related Subjects