Resiliency in numerical algorithm design for extreme scale simulations

Agullo, Emmanuel; Altenbernd, Mirco; Anzt, Hartwig; Bautista-Gomez, Leonardo; Benacchio, Tommaso; Bonaventura, Luca; Bungartz, Hans-Joachim; Chatterjee, Sanjay; Ciorba, Florina M.; DeBardeleben, Nathan; Drzisga, Daniel; Eibl, Sebastian; Engelmann, Christian; Gansterer, Wilfried N.; Giraud, Luc; Göddeke, Dominik; Heisig, Marco; Jézéquel, Fabienne; Kohl, Nils; Li, Xiaoye Sherry; Lion, Romain; Mehl, Miriam; Mycek, Paul; Obersteiner, Michael; Quintana-Ortí, Enrique S.; Rizzi, Francesco; Rüde, Ulrich; Schulz, Martin; Fung, Fred; Speck, Robert; Stals, Linda; Teranishi, Keita; Thibault, Samuel; Thönnes, Dominik; Wagner, Andreas; Wohlmuth, Barbara

doi:10.1177/10943420211055188

Title: Resiliency in numerical algorithm design for extreme scale simulations

Abstract

Here this work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 10²³ floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the meanmore »« less

Authors:

Agullo, Emmanuel ^[1]; Altenbernd, Mirco ^[2]; Anzt, Hartwig ^[3]; Bautista-Gomez, Leonardo ^[4]; Benacchio, Tommaso ^[5]; Bonaventura, Luca ^[5]; Bungartz, Hans-Joachim ^[6]; Chatterjee, Sanjay ^[7]; Ciorba, Florina M. ^[8]; DeBardeleben, Nathan ^[9]; Drzisga, Daniel ^[6]; Eibl, Sebastian ^[10]; Engelmann, Christian ^[11]; Gansterer, Wilfried N. ^[12]; Giraud, Luc ^[1]; Göddeke, Dominik ^[2]; Heisig, Marco ^[10]; Jézéquel, Fabienne ^[13]; Kohl, Nils ^[10]; Li, Xiaoye Sherry ^[14] more »

National Institute for Research in Digital Science and Technology (Inria), Rocquencourt (France)
Univ. of Stuttgart (Germany)
Karlsruher Institute of Technology (Germany)
Barcelona Supercomputing Center (Spain)
Polytechnic Univ. of Milan (Italy)
Technical Univ. of Munich (Germany)
NVIDIA Corporation, Santa Clara, CA (United States)
Univ. of Basel (Switzerland)
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Univ. of Erlangen, Nuremberg (Germany)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Univ. of Vienna (Austria)
Paris-Pantheon-Assas Univ., Paris (France)
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Univ. of Bordeaux (France)
Cerfacs, Toulouse (France)
Polytechnic Univ. of Valencia (UPV) (Spain)
NexGen Analytics, Sheridan, WY (United States)
Univ. of Erlangen, Nuremberg (Germany); Cerfacs, Toulouse (France)
Australian National Univ., Canberra, ACT (Australia)
Forschungszentrum Jülich GmbH (Germany)
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Publication Date:: Fri Dec 10 00:00:00 EST 2021

Research Org.:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States); Los Alamos National Laboratory (LANL), Los Alamos, NM (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Sponsoring Org.:: USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)

OSTI Identifier:: 1855669

Grant/Contract Number:: AC05-00OR22725

Resource Type:: Accepted Manuscript

Journal Name:: International Journal of High Performance Computing Applications

Additional Journal Information:: Journal Volume: 36; Journal Issue: 2; Journal ID: ISSN 1094-3420

Publisher:: SAGE

Country of Publication:: United States

Language:: English

Subject:: 79 ASTRONOMY AND ASTROPHYSICS; numerical algorithms; parallel computer architecture; fault tolerance; resilience

Citation Formats


                    Agullo, Emmanuel, Altenbernd, Mirco, Anzt, Hartwig, Bautista-Gomez, Leonardo, Benacchio, Tommaso, Bonaventura, Luca, Bungartz, Hans-Joachim, Chatterjee, Sanjay, Ciorba, Florina M., DeBardeleben, Nathan, Drzisga, Daniel, Eibl, Sebastian, Engelmann, Christian, Gansterer, Wilfried N., Giraud, Luc, Göddeke, Dominik, Heisig, Marco, Jézéquel, Fabienne, Kohl, Nils, Li, Xiaoye Sherry, Lion, Romain, Mehl, Miriam, Mycek, Paul, Obersteiner, Michael, Quintana-Ortí, Enrique S., Rizzi, Francesco, Rüde, Ulrich, Schulz, Martin, Fung, Fred, Speck, Robert, Stals, Linda, Teranishi, Keita, Thibault, Samuel, Thönnes, Dominik, Wagner, Andreas, and Wohlmuth, Barbara. Resiliency in numerical algorithm design for extreme scale simulations.  United States: N. p., 2021. 
Web.  doi:10.1177/10943420211055188.

Copy to clipboard


                    Agullo, Emmanuel, Altenbernd, Mirco, Anzt, Hartwig, Bautista-Gomez, Leonardo, Benacchio, Tommaso, Bonaventura, Luca, Bungartz, Hans-Joachim, Chatterjee, Sanjay, Ciorba, Florina M., DeBardeleben, Nathan, Drzisga, Daniel, Eibl, Sebastian, Engelmann, Christian, Gansterer, Wilfried N., Giraud, Luc, Göddeke, Dominik, Heisig, Marco, Jézéquel, Fabienne, Kohl, Nils, Li, Xiaoye Sherry, Lion, Romain, Mehl, Miriam, Mycek, Paul, Obersteiner, Michael, Quintana-Ortí, Enrique S., Rizzi, Francesco, Rüde, Ulrich, Schulz, Martin, Fung, Fred, Speck, Robert, Stals, Linda, Teranishi, Keita, Thibault, Samuel, Thönnes, Dominik, Wagner, Andreas, & Wohlmuth, Barbara. Resiliency in numerical algorithm design for extreme scale simulations.  United States.  https://doi.org/10.1177/10943420211055188

Copy to clipboard


                    Agullo, Emmanuel, Altenbernd, Mirco, Anzt, Hartwig, Bautista-Gomez, Leonardo, Benacchio, Tommaso, Bonaventura, Luca, Bungartz, Hans-Joachim, Chatterjee, Sanjay, Ciorba, Florina M., DeBardeleben, Nathan, Drzisga, Daniel, Eibl, Sebastian, Engelmann, Christian, Gansterer, Wilfried N., Giraud, Luc, Göddeke, Dominik, Heisig, Marco, Jézéquel, Fabienne, Kohl, Nils, Li, Xiaoye Sherry, Lion, Romain, Mehl, Miriam, Mycek, Paul, Obersteiner, Michael, Quintana-Ortí, Enrique S., Rizzi, Francesco, Rüde, Ulrich, Schulz, Martin, Fung, Fred, Speck, Robert, Stals, Linda, Teranishi, Keita, Thibault, Samuel, Thönnes, Dominik, Wagner, Andreas, and Wohlmuth, Barbara. Fri .  
"Resiliency in numerical algorithm design for extreme scale simulations".  United States.  https://doi.org/10.1177/10943420211055188.  https://www.osti.gov/servlets/purl/1855669.

Copy to clipboard


                    
@article{osti_1855669,

  title        = {Resiliency in numerical algorithm design for extreme scale simulations},

  author       = {Agullo, Emmanuel and Altenbernd, Mirco and Anzt, Hartwig and Bautista-Gomez, Leonardo and Benacchio, Tommaso and Bonaventura, Luca and Bungartz, Hans-Joachim and Chatterjee, Sanjay and Ciorba, Florina M. and DeBardeleben, Nathan and Drzisga, Daniel and Eibl, Sebastian and Engelmann, Christian and Gansterer, Wilfried N. and Giraud, Luc and Göddeke, Dominik and Heisig, Marco and Jézéquel, Fabienne and Kohl, Nils and Li, Xiaoye Sherry and Lion, Romain and Mehl, Miriam and Mycek, Paul and Obersteiner, Michael and Quintana-Ortí, Enrique S. and Rizzi, Francesco and Rüde, Ulrich and Schulz, Martin and Fung, Fred and Speck, Robert and Stals, Linda and Teranishi, Keita and Thibault, Samuel and Thönnes, Dominik and Wagner, Andreas and Wohlmuth, Barbara},

  abstractNote = {Here this work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.},

  doi          = {10.1177/10943420211055188},

  journal      = {International Journal of High Performance Computing Applications},

  number       = 2,

  volume       = 36,

  place        = {United States},

  year         = {Fri Dec 10 00:00:00 EST 2021},

  month        = {Fri Dec 10 00:00:00 EST 2021}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1177/10943420211055188

Other availability

Search WorldCat to find libraries that may hold this journal

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets
conference, May 2019

Keller, Kai; Bautista-Gomez, Leonardo
2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
DOI: 10.1109/CCGRID.2019.00015

Scalable, fault tolerant membership for MPI tasks on HPC systems
conference, January 2006

Varma, Jyothish; Wang, Chao; Mueller, Frank
Proceedings of the 20th annual international conference on Supercomputing - ICS '06
DOI: 10.1145/1183401.1183433

Toward fault-tolerant parallel-in-time integration with PFASST
journal, February 2017

Speck, Robert; Ruprecht, Daniel
Parallel Computing, Vol. 62
DOI: 10.1016/j.parco.2016.12.001

Correcting soft errors online in fast fourier transform
conference, January 2017

Liang, Xin; Chen, Zizhong; Chen, Jieyang
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
DOI: 10.1145/3126908.3126915

A highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma simulations
conference, November 2017

Obersteiner, Michael; Hinojosa, Alfredo Parra; Heene, Mario
SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
DOI: 10.1145/3148226.3148229

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
conference, March 2007

Wang, Chao; Mueller, Frank; Engelmann, Christian
2007 IEEE International Parallel and Distributed Processing Symposium
DOI: 10.1109/IPDPS.2007.370307

The Open Community Runtime: A runtime system for extreme scale computing
conference, September 2016

Mattson, Timothy G.; Cledat, Romain; Cave, Vincent
2016 IEEE High Performance Extreme Computing Conference (HPEC)
DOI: 10.1109/HPEC.2016.7761580

A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique
conference, July 2015

Ali, Md Mohsin; Strazdins, Peter E.; Harding, Brendan
2015 International Conference on High Performance Computing & Simulation (HPCS)
DOI: 10.1109/HPCSim.2015.7237082

ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability
journal, January 2012

George, Cijo; Vadhiyar, Sathish S.
Procedia Computer Science, Vol. 9
DOI: 10.1016/j.procs.2012.04.018

Algorithm-based fault tolerance for dense matrix factorizations
journal, September 2012

Du, Peng; Bouteiller, Aurelien; Bosilca, George
ACM SIGPLAN Notices, Vol. 47, Issue 8
DOI: 10.1145/2370036.2145845

An evaluation of lazy fault detection based on Adaptive Redundant Multithreading
conference, September 2014

Hukerikar, Saurabh; Teranishi, Keita; Diniz, Pedro C.
2014 IEEE High Performance Extreme Computing Conference (HPEC)
DOI: 10.1109/HPEC.2014.7040999

Exploiting asynchrony from exact forward recovery for DUE in iterative solvers
conference, November 2015

Jaulmes, Luc; Casas, Marc; Moretó, Miquel
SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1145/2807591.2807599

Investigating the Resilience of Dynamic Loop Scheduling in Heterogeneous Computing Systems
conference, June 2015

Sukhija, Nitin; Banicescu, Ioana; Ciorba, Florina M.
2015 14th International Symposium on Parallel and Distributed Computing (ISPDC)
DOI: 10.1109/ISPDC.2015.29

CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance
journal, March 2019

Shahzad, Faisal; Thies, Jonas; Kreutzer, Moritz
IEEE Transactions on Parallel and Distributed Systems, Vol. 30, Issue 3
DOI: 10.1109/TPDS.2018.2866794

MCALIB: Measuring Sensitivity to Rounding Error with Monte Carlo Programming
journal, April 2015

Frechtling, Michael; Leong, Philip H. W.
ACM Transactions on Programming Languages and Systems, Vol. 37, Issue 2
DOI: 10.1145/2665073

Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations
conference, May 2019

Cavelan, Aurelien; Cabezon, Ruben M.; Ciorba, Florina M.
2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
DOI: 10.1109/CCGRID.2019.00013

Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

Bosilca, George; Delmas, Rémi; Dongarra, Jack
Journal of Parallel and Distributed Computing, Vol. 69, Issue 4
DOI: 10.1016/j.jpdc.2008.12.002

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010

Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2010.18

Evaluating and extending user-level fault tolerance in MPI applications
journal, July 2016

Laguna, Ignacio; Richards, David F.; Gamblin, Todd
The International Journal of High Performance Computing Applications, Vol. 30, Issue 3
DOI: 10.1177/1094342015623623

A multirate time stepping strategy for stiff ordinary differential equations
journal, November 2006

Savcenco, V.; Hundsdorfer, W.; Verwer, J. G.
BIT Numerical Mathematics, Vol. 47, Issue 1
DOI: 10.1007/s10543-006-0095-7

A dimension adaptive sparse grid combination technique for machine learning
journal, April 2007

Garcke, Jochen
ANZIAM Journal, Vol. 48
DOI: 10.21914/anziamj.v48i0.70

A fault tolerant approach to microprocessor design
conference, January 2001

Weaver, C.; Austin, T.
Proceedings International Conference on Dependable Systems and Networks
DOI: 10.1109/DSN.2001.941425

Characterizing the impact of soft errors on iterative methods in scientific computing
conference, January 2011

Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
Proceedings of the international conference on Supercomputing - ICS '11
DOI: 10.1145/1995896.1995922

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
journal, February 2021

Benacchio, Tommaso; Bonaventura, Luca; Altenbernd, Mirco
The International Journal of High Performance Computing Applications, Vol. 35, Issue 4
DOI: 10.1177/1094342021990433

Berkeley lab checkpoint/restart (BLCR) for Linux clusters
journal, September 2006

Hargrove, Paul H.; Duell, Jason C.
Journal of Physics: Conference Series, Vol. 46
DOI: 10.1088/1742-6596/46/1/067

An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL
journal, April 2018

Tang, Xiongchao; Zhai, Jidong; Yu, Bowen
IEEE Transactions on Parallel and Distributed Systems, Vol. 29, Issue 4
DOI: 10.1109/TPDS.2017.2781257

Algorithm-Based Fault Tolerance for Parallel Stencil Computations
conference, September 2019

Cavelan, Aurelien; Ciorba, Florina M.
2019 IEEE International Conference on Cluster Computing (CLUSTER)
DOI: 10.1109/CLUSTER.2019.8891034

Methods of conjugate gradients for solving linear systems
journal, December 1952

Hestenes, M. R.; Stiefel, E.
Journal of Research of the National Bureau of Standards, Vol. 49, Issue 6
DOI: 10.6028/jres.049.044

Comparison between adaptive and uniform discontinuous Galerkin simulations in dry 2D bubble experiments
journal, February 2013

Müller, Andreas; Behrens, Jörn; Giraldo, Francis X.
Journal of Computational Physics, Vol. 235
DOI: 10.1016/j.jcp.2012.10.038

Fully Adaptive Multigrid Methods
journal, February 1993

Rüde, Ulrich
SIAM Journal on Numerical Analysis, Vol. 30, Issue 1
DOI: 10.1137/0730011

Tuning stationary iterative solvers for fault resilience
conference, January 2015

Anzt, Hartwig; Dongarra, Jack; Quintana-Ortí, Enrique S.
Proceedings of the 6th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '15
DOI: 10.1145/2832080.2832081

A two-scale approach for efficient on-the-fly operator assembly in massively parallel high performance multigrid codes
journal, December 2017

Bauer, S.; Mohr, M.; Rüde, U.
Applied Numerical Mathematics, Vol. 122
DOI: 10.1016/j.apnum.2017.07.006

A PIN-Based Dynamic Software Fault Injection System
conference, November 2008

Jin, Ang; Jiang, Jianhui; Hu, Jiawei
2008 9th International Conference for Young Computer Scientists (ICYCS), 2008 The 9th International Conference for Young Computer Scientists
DOI: 10.1109/ICYCS.2008.329

Extreme-Scale Block-Structured Adaptive Mesh Refinement
journal, January 2018

Schornbaum, Florian; Rüde, Ulrich
SIAM Journal on Scientific Computing, Vol. 40, Issue 3
DOI: 10.1137/17M1128411

A Stencil Scaling Approach for Accelerating Matrix-Free Finite Element Implementations
journal, January 2018

Bauer, S.; Drzisga, D.; Mohr, M.
SIAM Journal on Scientific Computing, Vol. 40, Issue 6
DOI: 10.1137/17M1148384

Discrete Stochastic Arithmetic for Validating Results of Numerical Software
journal, December 2004

Vignes, Jean
Numerical Algorithms, Vol. 37, Issue 1-4
DOI: 10.1023/B:NUMA.0000049483.75679.ce

rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks
conference, July 2019

Mohammed, Ali; Cavelan, Aurelien; Ciorba, Florina M.
2019 International Conference on High Performance Computing & Simulation (HPCS)
DOI: 10.1109/HPCS48598.2019.9188153

An efficient parallel implementation of explicit multirate Runge–Kutta schemes for discontinuous Galerkin computations
journal, January 2014

Seny, Bruno; Lambrechts, Jonathan; Toulorge, Thomas
Journal of Computational Physics, Vol. 256
DOI: 10.1016/j.jcp.2013.07.041

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
conference, November 2014

Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2014.78

Achieving algorithmic resilience for temporal integration through spectral deferred corrections
journal, January 2017

Grout, Ray; Kolla, Hemanth; Minion, Michael
Communications in Applied Mathematics and Computational Science, Vol. 12, Issue 1
DOI: 10.2140/camcos.2017.12.25

Resilient Matrix Multiplication of Hierarchical Semi-Separable Matrices
conference, June 2015

Austin, Brian; Roman, Eric; Li, Xiaoye
HPDC'15: The 24th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale
DOI: 10.1145/2751504.2751507

PapyrusKV: a high-performance parallel key-value store for distributed NVM architectures
conference, January 2017

Kim, Jungwon; Lee, Seyong; Vetter, Jeffrey S.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
DOI: 10.1145/3126908.3126943

Algorithm-based fault recovery of adaptively refined parallel multilevel grids
journal, August 2017

Stals, Linda
The International Journal of High Performance Computing Applications, Vol. 33, Issue 1
DOI: 10.1177/1094342017720801

FTI: high performance fault tolerance interface for hybrid systems
conference, January 2011

Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
DOI: 10.1145/2063384.2063427

On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing
journal, October 2015

Casanova, Henri; Robert, Yves; Vivien, Frédéric
Future Generation Computer Systems, Vol. 51
DOI: 10.1016/j.future.2015.04.003

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing
conference, March 2018

Ashraf, Rizwan A.; Hukerikar, Saurabh; Engelmann, Christian
ICPE '18: ACM/SPEC International Conference on Performance Engineering, Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering
DOI: 10.1145/3184407.3184421

Multivariate Quadrature on Adaptive Sparse Grids
journal, August 2003

Bungartz, H. -J.; Dirnstorfer, S.
Computing, Vol. 71, Issue 1
DOI: 10.1007/s00607-003-0016-4

Algorithms and data structures for massively parallel generic adaptive finite element codes
journal, December 2011

Bangerth, Wolfgang; Burstedde, Carsten; Heister, Timo
ACM Transactions on Mathematical Software, Vol. 38, Issue 2
DOI: 10.1145/2049673.2049678

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures
conference, August 2019

Pachajoa, Carlos; Levonyak, Markus; Gansterer, Wilfried N.
ICPP 2019: 48th International Conference on Parallel Processing, Proceedings of the 48th International Conference on Parallel Processing
DOI: 10.1145/3337821.3337849

A self adjusting multirate algorithm for robust time discretization of partial differential equations
journal, April 2020

Bonaventura, L.; Casella, F.; Carciopolo, L. Delpopolo
Computers & Mathematics with Applications, Vol. 79, Issue 7
DOI: 10.1016/j.camwa.2019.11.023

Reduced Triple Modular redundancy for built-in self-repair in VLIW-processors
conference, September 2007

Scholzel, Mario
2007 Signal Processing Algorithms, Architectures, Arrangements, and Applications (SPA 2007), Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA 2007
DOI: 10.1109/SPA.2007.5903294

Fault Tolerance in the Parareal Method
conference, May 2016

Nielsen, Allan S.; Hesthaven, Jan S.
HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale
DOI: 10.1145/2909428.2909431

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale
conference, May 2019

Nicolae, Bogdan; Moody, Adam; Gonsiorowski, Elsa
2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
DOI: 10.1109/IPDPS.2019.00099

Toward Exascale Resilience
journal, September 2009

Cappello, Franck; Geist, Al; Gropp, Bill
The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
DOI: 10.1177/1094342009347767

Multirate linear multistep methods
journal, December 1984

Gear, C. W.; Wells, D. R.
BIT, Vol. 24, Issue 4
DOI: 10.1007/BF01934907

Complex scientific applications made fault-tolerant with the sparse grid combination technique
journal, July 2016

Ali, Md Mohsin; Strazdins, Peter E.; Harding, Brendan
The International Journal of High Performance Computing Applications, Vol. 30, Issue 3
DOI: 10.1177/1094342015628056

FlipBack: Automatic Targeted Protection against Silent Data Corruption
conference, November 2016

Ni, Xiang; Kale, Laxmikant V.
SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2016.28

Robust distributed orthogonalization based on randomized aggregation
conference, January 2011

Gansterer, Wilfried N.; Niederbrucker, Gerhard; Straková, Hana
Proceedings of the second workshop on Scalable algorithms for large-scale systems - ScalA '11
DOI: 10.1145/2133173.2133177

Resilience for Massively Parallel Multigrid Solvers
journal, January 2016

Huber, Markus; Gmeiner, Björn; Rüde, Ulrich
SIAM Journal on Scientific Computing, Vol. 38, Issue 5
DOI: 10.1137/15M1026122

Parallel adaptive FETI‐DP using lightweight asynchronous dynamic load balancing
journal, October 2019

Klawonn, Axel; Kühn, Martin J.; Rheinbach, Oliver
International Journal for Numerical Methods in Engineering, Vol. 121, Issue 4
DOI: 10.1002/nme.6237

Fault tolerant communication-optimal 2.5D matrix multiplication
journal, June 2017

Moldaschl, Michael; Prikopa, Karl E.; Gansterer, Wilfried N.
Journal of Parallel and Distributed Computing, Vol. 104
DOI: 10.1016/j.jpdc.2017.01.022

On asynchronous iterations
journal, November 2000

Frommer, Andreas; Szyld, Daniel B.
Journal of Computational and Applied Mathematics, Vol. 123, Issue 1-2
DOI: 10.1016/S0377-0427(00)00409-X

Programming Models and Development Software for a Space-Based Many-Core Processor
conference, August 2011

Crago, Stephen P.; Kang, Dong-In; Kang, Mikyung
2011 IEEE International Conference on Space Mission Challenges for Information Technology (SMC-IT), 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technology
DOI: 10.1109/SMC-IT.2011.29

Soft fault detection and correction for multigrid
journal, February 2017

Altenbernd, Mirco; Göddeke, Dominik
The International Journal of High Performance Computing Applications, Vol. 32, Issue 6
DOI: 10.1177/1094342016684006

Evaluating Support for OpenMP Offload Features
conference, January 2018

Diaz, Jose Monsalve; Pophale, Swaroop; Friedline, Kyle
Proceedings of the 47th International Conference on Parallel Processing Companion - ICPP '18
DOI: 10.1145/3229710.3229717

Anisotropic mesh adaptivity for multi-scale ocean modelling
journal, November 2009

Piggott, M. D.; Farrell, P. E.; Wilson, C. R.
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 367, Issue 1907
DOI: 10.1098/rsta.2009.0155

Fine-Grained Parallel Incomplete LU Factorization
journal, January 2015

Chow, Edmond; Patel, Aftab
SIAM Journal on Scientific Computing, Vol. 37, Issue 2
DOI: 10.1137/140968896

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
journal, January 2008

Langou, J.; Chen, Z.; Bosilca, G.
SIAM Journal on Scientific Computing, Vol. 30, Issue 1
DOI: 10.1137/040620394

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques
conference, August 2016

Gamell, Marc; Katz, Daniel S.; Teranishi, Keita
2016 45th International Conference on Parallel Processing Workshops (ICPPW)
DOI: 10.1109/ICPPW.2016.56

ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner
conference, May 2016

Rizzi, Francesco; Morris, Karla; Sargsyan, Khachik
HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale
DOI: 10.1145/2909428.2909429

Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions
journal, April 2018

Casas, Marc; Gansterer, Wilfried N.; Wimmer, Elias
The International Journal of High Performance Computing Applications, Vol. 33, Issue 2
DOI: 10.1177/1094342018762531

Unified fault-tolerance framework for hybrid task-parallel message-passing applications
journal, September 2016

Subasi, Omer; Martsinkevich, Tatiana; Zyulkyarov, Ferad
The International Journal of High Performance Computing Applications, Vol. 32, Issue 5
DOI: 10.1177/1094342016669416

A method of finite element tearing and interconnecting and its parallel solution algorithm
journal, October 1991

Farhat, Charbel; Roux, Francois-Xavier
International Journal for Numerical Methods in Engineering, Vol. 32, Issue 6
DOI: 10.1002/nme.1620320604

REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed
conference, November 2017

Georgakoudis, Giorgis; Laguna, Ignacio; Nikolopoulos, Dimitrios S.
SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1145/3126908.3126972

Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications
conference, May 2014

Di, Sheng; Bouguerra, Mohamed Slim; Bautista-Gomez, Leonardo
2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
DOI: 10.1109/IPDPS.2014.122

Discrete A Priori Bounds for the Detection of Corrupted PDE Solutions in Exascale Computations
journal, January 2017

Mycek, Paul; Rizzi, Francesco; Maître, Olivier Le
SIAM Journal on Scientific Computing, Vol. 39, Issue 1
DOI: 10.1137/15M1051786

Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods
conference, November 2018

Pachajoa, Carlos; Levonyak, Markus; Gansterer, Wilfried N.
2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
DOI: 10.1109/FTXS.2018.00009

Numerical recovery strategies for parallel resilient Krylov linear solvers: RESILIENCY IN KRYLOV LINEAR SOLVERS
journal, August 2016

Agullo, Emmanuel; Giraud, Luc; Guermouche, Abdou
Numerical Linear Algebra with Applications, Vol. 23, Issue 5
DOI: 10.1002/nla.2059

Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems
conference, October 2013

Engelmann, Christian; Naughton, Thomas
2013 42nd International Conference on Parallel Processing (ICPP)
DOI: 10.1109/ICPP.2013.114

Debugging and Optimization of HPC Programs with the Verrou Tool
conference, November 2019

Fevotte, Francois; Lathuiliere, Bruno
2019 IEEE/ACM 3rd International Workshop on Software Correctness for HPC Applications (Correctness)
DOI: 10.1109/Correctness49594.2019.00006

CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications: CPPC: COMPILER-ASSISTED PORTABLE CHECKPOINTING
journal, November 2009

Rodríguez, Gabriel; Martín, María J.; González, Patricia
Concurrency and Computation: Practice and Experience, Vol. 22, Issue 6
DOI: 10.1002/cpe.1541

Local rollback for resilient MPI applications with application-level checkpointing and message logging
journal, February 2019

Losada, Nuria; Bosilca, George; Bouteiller, Aurélien
Future Generation Computer Systems, Vol. 91
DOI: 10.1016/j.future.2018.09.041

A SIMD-based software fault tolerance for ARM processors
conference, May 2017

Lin, Shun-Zhi; Chen, Peng-Sheng
2017 International Conference on Applied System Innovation (ICASI)
DOI: 10.1109/ICASI.2017.7988587

Improving Application Resilience by Extending Error Correction with Contextual Information
conference, November 2018

Poulos, Alexandra; Wallace, Dylan; Robey, Robert
2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
DOI: 10.1109/FTXS.2018.00006

DIVA: a reliable substrate for deep submicron microarchitecture design
conference, January 1999

Austin, T. M.
MICRO-32. 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture
DOI: 10.1109/MICRO.1999.809458

Improving performance of iterative methods by lossy checkponting
conference, January 2018

Tao, Dingwen; Di, Sheng; Liang, Xin
Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18
DOI: 10.1145/3208040.3208050

RAJA: Portable Performance for Large-Scale Scientific Applications
conference, November 2019

Beckingsale, David A.; Scogland, Thomas RW; Burmark, Jason
2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)
DOI: 10.1109/P3HPC49587.2019.00012

Fault-tolerant least squares solvers for wireless sensor networks based on gossiping
journal, February 2020

Prikopa, Karl E.; Gansterer, Wilfried N.
Journal of Parallel and Distributed Computing, Vol. 136
DOI: 10.1016/j.jpdc.2019.09.006

Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
journal, October 2016

Di, Sheng; Cappello, Franck
IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 10
DOI: 10.1109/TPDS.2016.2517639

Chaotic relaxation
journal, April 1969

Chazan, D.; Miranker, W.
Linear Algebra and its Applications, Vol. 2, Issue 2
DOI: 10.1016/0024-3795(69)90028-7

VOCL-FT: introducing techniques for efficient soft error coprocessor recovery
conference, November 2015

Peña, Antonio J.; Bland, Wesley; Balaji, Pavan
SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1145/2807591.2807640

Toward General Software Level Silent Data Corruption Detection for Parallel Applications
journal, December 2017

Berrocal, Eduardo; Bautista-Gomez, Leonardo; Di, Sheng
IEEE Transactions on Parallel and Distributed Systems, Vol. 28, Issue 12
DOI: 10.1109/TPDS.2017.2735971

Stochastic subspace correction methods and fault tolerance
journal, August 2019

Griebel, Michael; Oswald, Peter
Mathematics of Computation, Vol. 89, Issue 321
DOI: 10.1090/mcom/3459

Tsunami modelling with adaptively refined finite volume methods
journal, April 2011

LeVeque, Randall J.; George, David L.; Berger, Marsha J.
Acta Numerica, Vol. 20
DOI: 10.1017/S0962492911000043

Proactive fault tolerance for HPC with Xen virtualization
conference, January 2007

Nagarajan, Arun Babu; Mueller, Frank; Engelmann, Christian
Proceedings of the 21st annual international conference on Supercomputing - ICS '07
DOI: 10.1145/1274971.1274978

Combining Partial Redundancy and Checkpointing for HPC
conference, June 2012

Elliott, James; Kharbas, Kishor; Fiala, David
2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS)
DOI: 10.1109/ICDCS.2012.56

Adaptive control in roll-forward recovery for extreme scale multigrid
journal, December 2018

Huber, Markus; Rüde, Ulrich; Wohlmuth, Barbara
The International Journal of High Performance Computing Applications, Vol. 33, Issue 5
DOI: 10.1177/1094342018817088

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
journal, December 2014

Carter Edwards, H.; Trott, Christian R.; Sunderland, Daniel
Journal of Parallel and Distributed Computing, Vol. 74, Issue 12
DOI: 10.1016/j.jpdc.2014.07.003

SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation
conference, April 2017

Hari, Siva Kumar Sastry; Tsai, Timothy; Stephenson, Mark
2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
DOI: 10.1109/ISPASS.2017.7975296

Hybrid Checkpointing for MPI Jobs in HPC Environments
conference, December 2010

Wang, Chao; Mueller, Frank; Engelmann, Christian
2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS)
DOI: 10.1109/ICPADS.2010.48

Exploring versioned distributed arrays for resilience in scientific applications: global view resilience
journal, September 2016

Chien, A.; Balaji, P.; Dun, N.
The International Journal of High Performance Computing Applications, Vol. 31, Issue 6
DOI: 10.1177/1094342016664796

Fine-grained bit-flip protection for relaxation methods
journal, September 2019

Anzt, Hartwig; Dongarra, Jack; Quintana-Ortí, Enrique S.
Journal of Computational Science, Vol. 36
DOI: 10.1016/j.jocs.2016.11.013

A scalable and extensible checkpointing scheme for massively parallel simulations
journal, May 2018

Kohl, Nils; Hötzer, Johannes; Schornbaum, Florian
The International Journal of High Performance Computing Applications, Vol. 33, Issue 4
DOI: 10.1177/1094342018767736

Asynchronous Iterative Methods for Multiprocessors
journal, April 1978

Baudet, Gérard M.
Journal of the ACM, Vol. 25, Issue 2
DOI: 10.1145/322063.322067

Error detection by duplicated instructions in super-scalar processors
journal, March 2002

Oh, N.; Shirvani, P. P.; McCluskey, E. J.
IEEE Transactions on Reliability, Vol. 51, Issue 1
DOI: 10.1109/24.994913

Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization
conference, May 2017

Tao, Dingwen; Di, Sheng; Chen, Zizhong
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
DOI: 10.1109/IPDPS.2017.115

Fault Tolerance Properties of Gossip-Based Distributed Orthogonal Iteration Methods
journal, January 2013

Straková, Hana; Niederbrucker, Gerhard; Gansterer, Wilfried N.
Procedia Computer Science, Vol. 18
DOI: 10.1016/j.procs.2013.05.182

HPX: A Task Based Programming Model in a Global Address Space
conference, January 2014

Kaiser, Hartmut; Heller, Thomas; Adelstein-Lelbach, Bryce
Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models - PGAS '14
DOI: 10.1145/2676870.2676883

Scalable and fault tolerant orthogonalization based on randomized distributed data aggregation
journal, November 2013

Gansterer, Wilfried N.; Niederbrucker, Gerhard; Straková, Hana
Journal of Computational Science, Vol. 4, Issue 6
DOI: 10.1016/j.jocs.2013.01.006

Supporting highly-decoupled thread-level redundancy for parallel programs
conference, February 2008

Rashid, M. Wasiur; Huang, Michael C.
2008 IEEE 14th International Symposium on High Performance Computer Architecture (HPCA)
DOI: 10.1109/HPCA.2008.4658655

Parallel reduction to hessenberg form with algorithm-based fault tolerance
conference, November 2013

Jia, Yulu; Bosilca, George; Luszczek, Piotr
SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
DOI: 10.1145/2503210.2503249

Asynchronous Iterative Algorithms with Flexible Communication for Nonlinear Network Flow Problems
journal, October 1996

El Baz, Didier; Spiteri, Pierre; Miellou, Jean Claude
Journal of Parallel and Distributed Computing, Vol. 38, Issue 1
DOI: 10.1006/jpdc.1996.0124

The GeoClaw software for depth-averaged flows with adaptive refinement
journal, September 2011

Berger, Marsha J.; George, David L.; LeVeque, Randall J.
Advances in Water Resources, Vol. 34, Issue 9
DOI: 10.1016/j.advwatres.2011.02.016

ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
conference, June 2012

Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua
2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
DOI: 10.1109/DSNW.2012.6264672

A scalable double in-memory checkpoint and restart scheme towards exascale
conference, June 2012

Zheng, Gengbin; Kale, Laxmikant V.
2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
DOI: 10.1109/DSNW.2012.6264677

Regression with the optimised combination technique
conference, January 2006

Garcke, Jochen
Proceedings of the 23rd international conference on Machine learning - ICML '06
DOI: 10.1145/1143844.1143885

Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets
conference, December 2018

Liang, Xin; Di, Sheng; Tao, Dingwen
2018 IEEE International Conference on Big Data (Big Data)
DOI: 10.1109/BigData.2018.8622520

Identifying the Right Replication Level to Detect and Correct Silent Errors at Scale
conference, January 2017

Benoit, Anne; Cavelan, Aurélien; Cappello, Franck
Proceedings of the 2017 Workshop on Fault-Tolerance for HPC at Extreme Scale - FTXS '17
DOI: 10.1145/3086157.3086162

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI
conference, November 2006

Coti, Camille; Herault, Thomas; Lemarinier, Pierre
SC 2006 Proceedings Supercomputing 2006, ACM/IEEE SC 2006 Conference (SC'06)
DOI: 10.1109/SC.2006.15

Proactive process-level live migration and back migration in HPC environments
journal, February 2012

Wang, Chao; Mueller, Frank; Engelmann, Christian
Journal of Parallel and Distributed Computing, Vol. 72, Issue 2, p. 254-267
DOI: 10.1016/j.jpdc.2011.10.009

Density Estimation with Adaptive Sparse Grids for Large Data Sets
conference, April 2014

Peherstorfer, Benjamin; Pflüge, Dirk; Bungartz, Hans-Joachim
Proceedings of the 2014 SIAM International Conference on Data Mining
DOI: 10.1137/1.9781611973440.51

Programmer-directed partial redundancy for resilient HPC
conference, May 2015

Subasi, Omer; Arias, Javier; Unsal, Osman
CF'15: Computing Frontiers Conference, Proceedings of the 12th ACM International Conference on Computing Frontiers
DOI: 10.1145/2742854.2742903

On the Resilience of Parallel Sparse Hybrid Solvers
conference, December 2015

Agullo, Emmanuel; Giraud, Luc; Zounon, Mawussi
2015 IEEE 22nd International Conference on High Performance Computing (HiPC)
DOI: 10.1109/HiPC.2015.9

Does partial replication pay off?
conference, June 2012

Stearley, Jon; Ferreira, Kurt; Robinson, David
2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)
DOI: 10.1109/DSNW.2012.6264669

Proactive Fault Tolerance Using Preemptive Migration
conference, February 2009

Engelmann, Christian; Vallee, Geoffroy R.; Naughton, Thomas
2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
DOI: 10.1109/PDP.2009.31

Design and modeling of a non-blocking checkpointing system
conference, November 2012

Sato, Kento; Maruyama, Naoya; Mohror, Kathryn
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2012.46

FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines
conference, January 2014

Wu, Panruo; Chen, Zizhong
Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - HPDC '14
DOI: 10.1145/2600212.2600232

Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
journal, August 2013

Chen, Zizhong
ACM SIGPLAN Notices, Vol. 48, Issue 8
DOI: 10.1145/2517327.2442533

Detection and correction of silent data corruption for large-scale high-performance computing
conference, November 2012

Fiala, David; Mueller, Frank; Engelmann, Christian
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2012.49

Dynamic Malleability in Iterative MPI Applications
conference, May 2007

El Maghraoui, Kaoutar; Desell, Travis J.; Szymanski, Boleslaw K.
Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)
DOI: 10.1109/CCGRID.2007.45

A Pattern Language for High-Performance Computing Resilience
conference, July 2017

Hukerikar, Saurabh; Engelmann, Christian
EuroPLoP '17: European Conference on Pattern Languages of Programs, Proceedings of the 22nd European Conference on Pattern Languages of Programs
DOI: 10.1145/3147704.3147718

Asynchronous optimized Schwarz methods with and without overlap
journal, March 2017

Magoulès, Frédéric; Szyld, Daniel B.; Venet, Cédric
Numerische Mathematik, Vol. 137, Issue 1
DOI: 10.1007/s00211-017-0872-z

MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection
conference, September 2017

Subasi, Omer; Di, Sheng; Balaprakash, Prasanna
2017 IEEE International Conference on Cluster Computing (CLUSTER)
DOI: 10.1109/CLUSTER.2017.128

F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability
conference, May 2014

Guan, Qiang; Debardeleben, Nathan; Blanchard, Sean
2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
DOI: 10.1109/IPDPS.2014.128

An ABFT Scheme Based on Communication Characteristics
conference, September 2016

Kabir, Upama; Goswami, Dhrubajyoti
2016 IEEE International Conference on Cluster Computing (CLUSTER)
DOI: 10.1109/CLUSTER.2016.68

SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing
conference, November 2013

Ropars, Thomas; Martsinkevich, Tatiana V.; Guermouche, Amina
SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
DOI: 10.1145/2503210.2503271

Towards resilient EU HPC systems: a blueprint
conference, April 2019

Radojkovic, Petar
CF '19: Computing Frontiers Conference, Proceedings of the 16th ACM International Conference on Computing Frontiers
DOI: 10.1145/3310273.3323434

A conservative implicit multirate method for hyperbolic problems
journal, August 2018

Delpopolo Carciopolo, Ludovica; Bonaventura, Luca; Scotti, Anna
Computational Geosciences, Vol. 23, Issue 4
DOI: 10.1007/s10596-018-9764-2

D is CV ar: discovering critical variables using algorithmic differentiation for transient faults
journal, March 2018

Menon, Harshitha; Mohror, Kathryn
ACM SIGPLAN Notices, Vol. 53, Issue 1
DOI: 10.1145/3200691.3178502

Performance Scaling Variability and Energy Analysis for a Resilient ULFM-based PDE Solver
conference, November 2016

Morris, K.; Rizzi, F.; Cook, B.
2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)
DOI: 10.1109/ScalA.2016.010

NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart
conference, March 2015

Subasi, Omer; Arias, Javier; Unsal, Osman
2015 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing
DOI: 10.1109/PDP.2015.17

Is the Multigrid Method Fault Tolerant? The Multilevel Case
journal, January 2017

Ainsworth, Mark; Glusa, Christian
SIAM Journal on Scientific Computing, Vol. 39, Issue 6
DOI: 10.1137/16M1097274

A survey of rollback-recovery protocols in message-passing systems
journal, September 2002

Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
ACM Computing Surveys, Vol. 34, Issue 3
DOI: 10.1145/568522.568525

Asynchronous Multigrid Methods
conference, May 2019

Wolfson-Pou, Jordi; Chow, Edmond
2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
DOI: 10.1109/IPDPS.2019.00021

A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets
conference, September 2016

Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad
2016 IEEE International Conference on Cluster Computing (CLUSTER)
DOI: 10.1109/CLUSTER.2016.54

Basic concepts and taxonomy of dependable and secure computing
journal, January 2004

Avizienis, A.; Laprie, J. -C.; Randell, B.
IEEE Transactions on Dependable and Secure Computing, Vol. 1, Issue 1
DOI: 10.1109/TDSC.2004.2

An algorithm-based error detection scheme for the multigrid method
journal, September 2003

Mishra, A.; Banerjee, P.
IEEE Transactions on Computers, Vol. 52, Issue 9
DOI: 10.1109/TC.2003.1228507

Towards End-to-end SDC Detection for HPC Applications Equipped with Lossy Compression
conference, September 2020

Li, Sihuan; Di, Sheng; Zhao, Kai
2020 IEEE International Conference on Cluster Computing (CLUSTER)
DOI: 10.1109/CLUSTER49012.2020.00043

Silent error detection in numerical time-stepping schemes
journal, April 2014

Benson, Austin R.; Schmit, Sven; Schreiber, Robert
The International Journal of High Performance Computing Applications, Vol. 29, Issue 4
DOI: 10.1177/1094342014532297

Transient-fault recovery using simultaneous multithreading
conference, January 2002

Vijaykumar, T. N.; Pomeranz, I.; Cheng, K.
Proceedings 29th Annual International Symposium on Computer Architecture
DOI: 10.1109/ISCA.2002.1003565

Self-stabilizing systems in spite of distributed control
journal, November 1974

Dijkstra, Edsger W.
Communications of the ACM, Vol. 17, Issue 11
DOI: 10.1145/361179.361202

From tasks graphs to asynchronous distributed checkpointing with local restart
conference, November 2020

Lion, Romain; Thibault, Samuel
2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
DOI: 10.1109/FTXS51974.2020.00009

On the Analysis of Block Smoothers for Saddle Point Problems
journal, January 2018

Drzisga, Daniel; John, Lorenz; Rüde, Ulrich
SIAM Journal on Matrix Analysis and Applications, Vol. 39, Issue 2
DOI: 10.1137/16M1106304

Fault Tolerant Computation with the Sparse Grid Combination Technique
journal, January 2015

Harding, Brendan; Hegland, Markus; Larson, Jay
SIAM Journal on Scientific Computing, Vol. 37, Issue 3
DOI: 10.1137/140964448

Self-stabilizing iterative solvers
conference, January 2013

Sao, Piyush; Vuduc, Richard
Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - ScalA '13
DOI: 10.1145/2530268.2530272

Mini-Ckpts: Surviving OS Failures in Persistent Memory
conference, June 2016

Fiala, David; Mueller, Frank; Ferreira, Kurt
ICS '16: 2016 International Conference on Supercomputing, Proceedings of the 2016 International Conference on Supercomputing
DOI: 10.1145/2925426.2926295

Massively Parallel Algorithms for the Lattice Boltzmann Method on NonUniform Grids
journal, January 2016

Schornbaum, Florian; Rüde, Ulrich
SIAM Journal on Scientific Computing, Vol. 38, Issue 2
DOI: 10.1137/15M1035240

Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner
journal, April 2018

Rizzi, F.; Morris, K.; Sargsyan, K.
Parallel Computing, Vol. 73
DOI: 10.1016/j.parco.2017.05.005

Algorithm-Based Fault Tolerance for Convolutional Neural Networks
journal, January 2021

Zhao, Kai; Di, Sheng; Li, Sihuan
IEEE Transactions on Parallel and Distributed Systems
DOI: 10.1109/tpds.2020.3043449

A Posteriori Error Estimates Based on Hierarchical Bases
journal, August 1993

Bank, Randolph E.; Smith, R. Kent
SIAM Journal on Numerical Analysis, Vol. 30, Issue 4
DOI: 10.1137/0730048

Performance of asynchronous optimized Schwarz with one-sided communication
journal, August 2019

Yamazaki, Ichitaro; Chow, Edmond; Bouteiller, Aurelien
Parallel Computing, Vol. 86
DOI: 10.1016/j.parco.2019.05.004

CHARM++: a portable concurrent object oriented system based on C++
conference, January 1993

Kale, Laxmikant V.; Krishnan, Sanjeev
Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications - OOPSLA '93
DOI: 10.1145/165854.165874

Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
conference, January 2014

Teranishi, Keita; Heroux, Michael A.
Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14
DOI: 10.1145/2642769.2642774

A Class of Multirate Infinitesimal GARK Methods
journal, January 2019

Sandu, Adrian
SIAM Journal on Numerical Analysis, Vol. 57, Issue 5
DOI: 10.1137/18M1205492

Fast Error-Bounded Lossy HPC Data Compression with SZ
conference, May 2016

Di, Sheng; Cappello, Franck
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
DOI: 10.1109/IPDPS.2016.11

Residual Replacement Strategies for Krylov Subspace Iterative Methods for the Convergence of True Residuals
journal, January 2000

van der Vorst, Henk A.; Ye, Qiang
SIAM Journal on Scientific Computing, Vol. 22, Issue 3
DOI: 10.1137/S1064827599353865

Dimension?Adaptive Tensor?Product Quadrature
journal, August 2003

Gerstner, T.; Griebel, M.
Computing, Vol. 71, Issue 1
DOI: 10.1007/s00607-003-0015-5

FlipSphere: A Software-Based DRAM Error Detection and Correction Library for HPC
conference, September 2016

Fiala, David; Mueller, Frank; Ferreira, Kurt B.
2016 IEEE/ACM 20th International Symposium on Distributed Simulation and Real Time Applications (DS-RT)
DOI: 10.1109/DS-RT.2016.27

A semi-implicit, semi-Lagrangian discontinuous Galerkin framework for adaptive numerical weather prediction: SISL-DG Framework for Adaptive NWP
journal, May 2015

Tumolo, Giovanni; Bonaventura, Luca
Quarterly Journal of the Royal Meteorological Society, Vol. 141, Issue 692
DOI: 10.1002/qj.2544

Recent Advances and New Avenues in Hardware-Level Reliability Support
journal, November 2005

Iyer, R. K.; Nakka, N. M.; Kalbarczyk, Z. T.
IEEE Micro, Vol. 25, Issue 6
DOI: 10.1109/MM.2005.119

Large-scale simulation of mantle convection based on a new matrix-free approach
journal, February 2019

Bauer, S.; Huber, M.; Ghelichkhan, S.
Journal of Computational Science, Vol. 31
DOI: 10.1016/j.jocs.2018.12.006

Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers
conference, January 2005

Gioiosa, R.; Sancho, J. C.; Jiang, S.
ACM/IEEE SC 2005 Conference (SC'05)
DOI: 10.1109/SC.2005.76

Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984

Kuang-Hua Huang, ; Abraham, Jacob A.
IEEE Transactions on Computers, Vol. C-33, Issue 6
DOI: 10.1109/TC.1984.1676475

OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES
journal, June 2011

Duran, Alejandro; AyguadÉ, Eduard; Badia, Rosa M.
Parallel Processing Letters, Vol. 21, Issue 02
DOI: 10.1142/S0129626411000151

Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra
conference, May 2016

Wu, Panruo; Guan, Qiang; DeBardeleben, Nathan
HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing
DOI: 10.1145/2907294.2907315

SWIFT: Software Implemented Fault Tolerance
conference, January 2005

Reis, G. A.; Chang, J.; Vachharajani, N.
International Symposium on Code Generation and Optimization
DOI: 10.1109/CGO.2005.34

Interpolation-Restart Strategies for Resilient Eigensolvers
journal, January 2016

Agullo, E.; Giraud, L.; Salas, P.
SIAM Journal on Scientific Computing, Vol. 38, Issue 5
DOI: 10.1137/15M1042115

Parallel asynchronous algorithms: A survey
journal, November 2020

Spiteri, Pierre
Advances in Engineering Software, Vol. 149
DOI: 10.1016/j.advengsoft.2020.102896

Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications
conference, May 2017

Subasi, Omer; Yalcin, Gulay; Zyulkyarov, Ferad
2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
DOI: 10.1109/CCGRID.2017.40

Optimal Resilience Patterns to Cope with Fail-Stop and Silent Errors
conference, May 2016

Benoit, Anne; Cavelan, Aurelien; Robert, Yves
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
DOI: 10.1109/IPDPS.2016.39

A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient Solvers
journal, July 2018

Cantwell, Chris D.; Nielsen, Allan S.
Journal of Scientific Computing, Vol. 78, Issue 1
DOI: 10.1007/s10915-018-0778-7

Towards Textbook Efficiency for Parallel Multigrid
journal, February 2015

Gmeiner, Björn; Rüde, Ulrich; Stengel, Holger
Numerical Mathematics: Theory, Methods and Applications, Vol. 8, Issue 1
DOI: 10.4208/nmtma.2015.w10si

Symmetric active/active metadata service for high availability parallel file systems
journal, December 2009

He, Xubin; Ou, Li; Engelmann, Christian
Journal of Parallel and Distributed Computing, Vol. 69, Issue 12
DOI: 10.1016/j.jpdc.2009.08.004

Verificarlo: Checking Floating Point Accuracy through Monte Carlo Arithmetic
conference, July 2016

Denis, Christophe; De Oliveira Castro, Pablo; Petit, Eric
2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH)
DOI: 10.1109/ARITH.2016.31

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods
conference, January 2016

Tao, Dingwen; Song, Shuaiwen Leon; Krishnamoorthy, Sriram
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16
DOI: 10.1145/2907294.2907306

Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing
journal, November 2015

Göddeke, Dominik; Altenbernd, Mirco; Ribbrock, Dirk
Parallel Computing, Vol. 49
DOI: 10.1016/j.parco.2015.07.003

Parallel adaptive FETI‐DP using lightweight asynchronous dynamic load balancing
journal, October 2019

Klawonn, Axel; Kühn, Martin J.; Rheinbach, Oliver
International Journal for Numerical Methods in Engineering, Vol. 121, Issue 4
DOI: 10.1002/nme.6237

Distributed asynchronous computation of fixed points
journal, September 1983

Bertsekas, Dimitri P.
Mathematical Programming, Vol. 27, Issue 1
DOI: 10.1007/bf02591967

Multivariate Quadrature on Adaptive Sparse Grids
journal, August 2003

Bungartz, H. -J.; Dirnstorfer, S.
Computing, Vol. 71, Issue 1
DOI: 10.1007/s00607-003-0016-4

Chaotic relaxation
journal, April 1969

Chazan, D.; Miranker, W.
Linear Algebra and its Applications, Vol. 2, Issue 2
DOI: 10.1016/0024-3795(69)90028-7

Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

Bosilca, George; Delmas, Rémi; Dongarra, Jack
Journal of Parallel and Distributed Computing, Vol. 69, Issue 4
DOI: 10.1016/j.jpdc.2008.12.002

ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability
journal, January 2012

George, Cijo; Vadhiyar, Sathish S.
Procedia Computer Science, Vol. 9
DOI: 10.1016/j.procs.2012.04.018

The Lanczos and conjugate gradient algorithms in finite precision arithmetic
journal, May 2006

Meurant, Gérard; Strakoš, Zdeněk
Acta Numerica, Vol. 15
DOI: 10.1017/s096249290626001x

Berkeley lab checkpoint/restart (BLCR) for Linux clusters
journal, September 2006

Hargrove, Paul H.; Duell, Jason C.
Journal of Physics: Conference Series, Vol. 46
DOI: 10.1088/1742-6596/46/1/067

Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
conference, June 2014

Hassani, Amin; Skjellum, Anthony; Brightwell, Ron
2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
DOI: 10.1109/dsn.2014.78

Node-Failure-Resistant Preconditioned Conjugate Gradient Method without Replacement Nodes
conference, November 2019

Pachajoa, Carlos; Pacher, Christina; Gansterer, Wilfried N.
2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
DOI: 10.1109/ftxs49593.2019.00009

Evaluating the Impact of SDC on the GMRES Iterative Solver
conference, May 2014

Elliott, James; Hoemmen, Mark; Mueller, Frank
2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
DOI: 10.1109/ipdps.2014.123

Dynamic load balancing and efficient load estimators for asynchronous iterative algorithms
journal, April 2005

Bahi, J. M.; Contassot-Vivier, S.; Couturier, R.
IEEE Transactions on Parallel and Distributed Systems, Vol. 16, Issue 4
DOI: 10.1109/tpds.2005.45

On Soft Errors in the Conjugate Gradient Method: Sensitivity and Robust Numerical Detection
journal, January 2020

Agullo, Emmanuel; Cools, Siegfried; Yetkin, Emrullah Fatih
SIAM Journal on Scientific Computing, Vol. 42, Issue 6
DOI: 10.1137/18m122858x

Regression with the optimised combination technique
conference, January 2006

Garcke, Jochen
Proceedings of the 23rd international conference on Machine learning - ICML '06
DOI: 10.1145/1143844.1143885

Fault Tolerance in the Parareal Method
conference, May 2016

Nielsen, Allan S.; Hesthaven, Jan S.
HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale
DOI: 10.1145/2909428.2909431

PapyrusKV: a high-performance parallel key-value store for distributed NVM architectures
conference, January 2017

Kim, Jungwon; Lee, Seyong; Vetter, Jeffrey S.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
DOI: 10.1145/3126908.3126943

A dimension adaptive sparse grid combination technique for machine learning
journal, April 2007

Garcke, Jochen
ANZIAM Journal, Vol. 48
DOI: 10.21914/anziamj.v48i0.70

A Pattern Language for High-Performance Computing Resilience
text, January 2017

Hukerikar, Saurabh; Engelmann, Christian
arXiv
DOI: 10.48550/arxiv.1710.09074

rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Parallel Independent Tasks
preprint, January 2019

Mohammed, Ali; Cavelan, Aurelien; Ciorba, Florina M.
arXiv
DOI: 10.48550/arxiv.1905.08073

Algorithm-Based Fault Tolerance for Parallel Stencil Computations
preprint, January 2019

Cavelan, Aurélien; Ciorba, Florina M.
arXiv
DOI: 10.48550/arxiv.1909.00709

Similar Records in DOE PAGES and OSTI.GOV collections:

Steps toward fault-tolerant quantum chemistry.

Technical Report Taube, Andrew Garvin

Developing quantum chemistry programs on the coming generation of exascale computers will be a difficult task. The programs will need to be fault-tolerant and minimize the use of global operations. This work explores the use a task-based model that uses a data-centric approach to allocate work to different processes as it applies to quantum chemistry. After introducing the key problems that appear when trying to parallelize a complicated quantum chemistry method such as coupled-cluster theory, we discuss the implications of that model as it pertains to the computational kernel of a coupled-cluster program - matrix multiplication. Also, we discuss themore »« less
https://doi.org/10.2172/992330

Full Text Available
A case for Virtual Machine based Fault Injection in a High-Performance Computing Environment

Conference Naughton, III, Thomas J ; Vallee, Geoffroy R ; Engelmann, Christian ; ...

Large-scale computing platforms provide tremendous capabilities for scientific discovery. These systems have hundreds of thousands of computing cores, hundreds of terabytes of memory, and enormous high-performance interconnection networks. These systems are facing enormous challenges to achieve performance at such scale. Failures are an Achilles heel of these enormous systems. As applications and system software scale up to multi-petaflop and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for HPC systems. This includes work on log analysis to identify types of failures, enhancements tomore »« less
Holistic Measurement Driven Resilience: Combining Operational Fault and Failure Measurements and Fault Injection for Quantifying Fault Detection, Propagation and Impact. Final report

Technical Report Kramer, William ; Jha, Saurabh ; Brandt, James ; ...

For HPC systems to date, application resilience to faults and failures has been accomplished by the brute- force method of checkpoint/restart, which allows an application to make forward progress in the face of system and application faults, errors, and failures independent of root cause or end result. It has remained the primary resilience mechanism because we lack a way to identify faults and anticipate consequences early enough to take meaningful mitigating action. However, checkpoint/restart implementations put a tremendous burden on system resources and on the applications themselves and is becoming less feasible at scale. Because we have not yet operatedmore »« less
https://doi.org/10.2172/1615150

Full Text Available
DINO: Divergent node cloning for sustained redundancy in HPC

Journal Article Rezaei, Arash ; Mueller, Frank ; Hargrove, Paul ; ... - Journal of Parallel and Distributed Computing

Complexity and scale of next generation HPC systems pose significant challenges in fault resilience methods such that contemporary checkpoint/restart (C/R) methods that address fail-stop behavior may be insufficient. Redundant computing has been proposed as an alternative at extreme scale. Triple redundancy has an advantage over C/R in that it can also detect silent data corruption (SDC) and then correct results via voting. However, current redundant computing approaches do not repair failed or corrupted replicas. Consequently, SDCs can no longer be detected after a replica failure since the system has been degraded to dual redundancy without voting capability. Hence, a jobmore »« less
https://doi.org/10.1016/j.jpdc.2017.06.010

Full Text Available
Data Locality Enhancement of Dynamic Simulations for Exascale Computing (Final Report)

Technical Report Shen, Xipeng

The development of modern processors exhibits two trends that complicate the optimizations of modern software. The first is the increasing sensitivity of processors' throughput to irregularities in computation. With more processors produced through a massive integration of simple cores, future systems will increasingly favor regular data-level parallel computations, but deviate from the needs of applications with complex patterns. Some evidences are already shown on Graphic Processing Units (GPU): Irregular data accesses (e.g., indirect references A[D[i]]) and conditional branches are limiting many GPU applications' performance at a level an order of magnitude lower than the peak of GPU. The second hardwaremore »« less
https://doi.org/10.2172/1576175

Full Text Available

Similar Records

Title: Resiliency in numerical algorithm design for extreme scale simulations

Abstract

Citation Formats

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets conference, May 2019

Scalable, fault tolerant membership for MPI tasks on HPC systems conference, January 2006

Toward fault-tolerant parallel-in-time integration with PFASST journal, February 2017

Correcting soft errors online in fast fourier transform conference, January 2017

A highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma simulations conference, November 2017

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance conference, March 2007

The Open Community Runtime: A runtime system for extreme scale computing conference, September 2016

A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique conference, July 2015

ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability journal, January 2012

Algorithm-based fault tolerance for dense matrix factorizations journal, September 2012

An evaluation of lazy fault detection based on Adaptive Redundant Multithreading conference, September 2014

Exploiting asynchrony from exact forward recovery for DUE in iterative solvers conference, November 2015

Investigating the Resilience of Dynamic Loop Scheduling in Heterogeneous Computing Systems conference, June 2015

CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance journal, March 2019

MCALIB: Measuring Sensitivity to Rounding Error with Monte Carlo Programming journal, April 2015

Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations conference, May 2019

Algorithm-based fault tolerance applied to high performance computing journal, April 2009

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System conference, November 2010

Evaluating and extending user-level fault tolerance in MPI applications journal, July 2016

A multirate time stepping strategy for stiff ordinary differential equations journal, November 2006

A dimension adaptive sparse grid combination technique for machine learning journal, April 2007

A fault tolerant approach to microprocessor design conference, January 2001

Characterizing the impact of soft errors on iterative methods in scientific computing conference, January 2011

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction journal, February 2021

Berkeley lab checkpoint/restart (BLCR) for Linux clusters journal, September 2006

An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL journal, April 2018

Algorithm-Based Fault Tolerance for Parallel Stencil Computations conference, September 2019

Methods of conjugate gradients for solving linear systems journal, December 1952

Comparison between adaptive and uniform discontinuous Galerkin simulations in dry 2D bubble experiments journal, February 2013

Fully Adaptive Multigrid Methods journal, February 1993

Tuning stationary iterative solvers for fault resilience conference, January 2015

A two-scale approach for efficient on-the-fly operator assembly in massively parallel high performance multigrid codes journal, December 2017

A PIN-Based Dynamic Software Fault Injection System conference, November 2008

Extreme-Scale Block-Structured Adaptive Mesh Refinement journal, January 2018

A Stencil Scaling Approach for Accelerating Matrix-Free Finite Element Implementations journal, January 2018

Discrete Stochastic Arithmetic for Validating Results of Numerical Software journal, December 2004

rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks conference, July 2019

An efficient parallel implementation of explicit multirate Runge–Kutta schemes for discontinuous Galerkin computations journal, January 2014

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales conference, November 2014

Achieving algorithmic resilience for temporal integration through spectral deferred corrections journal, January 2017

Resilient Matrix Multiplication of Hierarchical Semi-Separable Matrices conference, June 2015

PapyrusKV: a high-performance parallel key-value store for distributed NVM architectures conference, January 2017

Algorithm-based fault recovery of adaptively refined parallel multilevel grids journal, August 2017

FTI: high performance fault tolerance interface for hybrid systems conference, January 2011

On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing journal, October 2015

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing conference, March 2018

Multivariate Quadrature on Adaptive Sparse Grids journal, August 2003

Algorithms and data structures for massively parallel generic adaptive finite element codes journal, December 2011

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures conference, August 2019

A self adjusting multirate algorithm for robust time discretization of partial differential equations journal, April 2020

Reduced Triple Modular redundancy for built-in self-repair in VLIW-processors conference, September 2007

Fault Tolerance in the Parareal Method conference, May 2016

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale conference, May 2019

Toward Exascale Resilience journal, September 2009

Multirate linear multistep methods journal, December 1984

Complex scientific applications made fault-tolerant with the sparse grid combination technique journal, July 2016

FlipBack: Automatic Targeted Protection against Silent Data Corruption conference, November 2016

Robust distributed orthogonalization based on randomized aggregation conference, January 2011

Resilience for Massively Parallel Multigrid Solvers journal, January 2016

Parallel adaptive FETI‐DP using lightweight asynchronous dynamic load balancing journal, October 2019

Fault tolerant communication-optimal 2.5D matrix multiplication journal, June 2017

On asynchronous iterations journal, November 2000

Programming Models and Development Software for a Space-Based Many-Core Processor conference, August 2011

Soft fault detection and correction for multigrid journal, February 2017

Evaluating Support for OpenMP Offload Features conference, January 2018

Anisotropic mesh adaptivity for multi-scale ocean modelling journal, November 2009

Fine-Grained Parallel Incomplete LU Factorization journal, January 2015

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment journal, January 2008

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques conference, August 2016

ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner conference, May 2016

Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions journal, April 2018

Unified fault-tolerance framework for hybrid task-parallel message-passing applications journal, September 2016

A method of finite element tearing and interconnecting and its parallel solution algorithm journal, October 1991

REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed conference, November 2017

Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications conference, May 2014

Discrete A Priori Bounds for the Detection of Corrupted PDE Solutions in Exascale Computations journal, January 2017

Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods conference, November 2018

Application-Level Differential Checkpointing for HPC Applications with Dynamic Datasets
conference, May 2019

Scalable, fault tolerant membership for MPI tasks on HPC systems
conference, January 2006

Toward fault-tolerant parallel-in-time integration with PFASST
journal, February 2017

Correcting soft errors online in fast fourier transform
conference, January 2017

A highly scalable, algorithm-based fault-tolerant solver for gyrokinetic plasma simulations
conference, November 2017

A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance
conference, March 2007

The Open Community Runtime: A runtime system for extreme scale computing
conference, September 2016

A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique
conference, July 2015

ADFT: An Adaptive Framework for Fault Tolerance on Large Scale Systems using Application Malleability
journal, January 2012

Algorithm-based fault tolerance for dense matrix factorizations
journal, September 2012

An evaluation of lazy fault detection based on Adaptive Redundant Multithreading
conference, September 2014

Exploiting asynchrony from exact forward recovery for DUE in iterative solvers
conference, November 2015

Investigating the Resilience of Dynamic Loop Scheduling in Heterogeneous Computing Systems
conference, June 2015

CRAFT: A Library for Easier Application-Level Checkpoint/Restart and Automatic Fault Tolerance
journal, March 2019

MCALIB: Measuring Sensitivity to Rounding Error with Monte Carlo Programming
journal, April 2015

Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations
conference, May 2019

Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010

Evaluating and extending user-level fault tolerance in MPI applications
journal, July 2016

A multirate time stepping strategy for stiff ordinary differential equations
journal, November 2006

A dimension adaptive sparse grid combination technique for machine learning
journal, April 2007

A fault tolerant approach to microprocessor design
conference, January 2001

Characterizing the impact of soft errors on iterative methods in scientific computing
conference, January 2011

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
journal, February 2021

Berkeley lab checkpoint/restart (BLCR) for Linux clusters
journal, September 2006

An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL
journal, April 2018

Algorithm-Based Fault Tolerance for Parallel Stencil Computations
conference, September 2019

Methods of conjugate gradients for solving linear systems
journal, December 1952

Comparison between adaptive and uniform discontinuous Galerkin simulations in dry 2D bubble experiments
journal, February 2013

Fully Adaptive Multigrid Methods
journal, February 1993

Tuning stationary iterative solvers for fault resilience
conference, January 2015

A two-scale approach for efficient on-the-fly operator assembly in massively parallel high performance multigrid codes
journal, December 2017

A PIN-Based Dynamic Software Fault Injection System
conference, November 2008

Extreme-Scale Block-Structured Adaptive Mesh Refinement
journal, January 2018

A Stencil Scaling Approach for Accelerating Matrix-Free Finite Element Implementations
journal, January 2018

Discrete Stochastic Arithmetic for Validating Results of Numerical Software
journal, December 2004

rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks
conference, July 2019

An efficient parallel implementation of explicit multirate Runge–Kutta schemes for discontinuous Galerkin computations
journal, January 2014

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
conference, November 2014

Achieving algorithmic resilience for temporal integration through spectral deferred corrections
journal, January 2017

Resilient Matrix Multiplication of Hierarchical Semi-Separable Matrices
conference, June 2015

PapyrusKV: a high-performance parallel key-value store for distributed NVM architectures
conference, January 2017

Algorithm-based fault recovery of adaptively refined parallel multilevel grids
journal, August 2017

FTI: high performance fault tolerance interface for hybrid systems
conference, January 2011

On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing
journal, October 2015

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing
conference, March 2018

Multivariate Quadrature on Adaptive Sparse Grids
journal, August 2003

Algorithms and data structures for massively parallel generic adaptive finite element codes
journal, December 2011

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures
conference, August 2019

A self adjusting multirate algorithm for robust time discretization of partial differential equations
journal, April 2020

Reduced Triple Modular redundancy for built-in self-repair in VLIW-processors
conference, September 2007

Fault Tolerance in the Parareal Method
conference, May 2016

VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale
conference, May 2019

Toward Exascale Resilience
journal, September 2009

Multirate linear multistep methods
journal, December 1984

Complex scientific applications made fault-tolerant with the sparse grid combination technique
journal, July 2016

FlipBack: Automatic Targeted Protection against Silent Data Corruption
conference, November 2016

Robust distributed orthogonalization based on randomized aggregation
conference, January 2011

Resilience for Massively Parallel Multigrid Solvers
journal, January 2016

Parallel adaptive FETI‐DP using lightweight asynchronous dynamic load balancing
journal, October 2019

Fault tolerant communication-optimal 2.5D matrix multiplication
journal, June 2017

On asynchronous iterations
journal, November 2000

Programming Models and Development Software for a Space-Based Many-Core Processor
conference, August 2011

Soft fault detection and correction for multigrid
journal, February 2017

Evaluating Support for OpenMP Offload Features
conference, January 2018

Anisotropic mesh adaptivity for multi-scale ocean modelling
journal, November 2009

Fine-Grained Parallel Incomplete LU Factorization
journal, January 2015

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
journal, January 2008

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques
conference, August 2016

ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner
conference, May 2016

Resilient gossip-inspired all-reduce algorithms for high-performance computing: Potential, limitations, and open questions
journal, April 2018

Unified fault-tolerance framework for hybrid task-parallel message-passing applications
journal, September 2016

A method of finite element tearing and interconnecting and its parallel solution algorithm
journal, October 1991

REFINE: realistic fault injection via compiler-based instrumentation for accuracy, portability and speed
conference, November 2017

Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications
conference, May 2014

Discrete A Priori Bounds for the Detection of Corrupted PDE Solutions in Exascale Computations
journal, January 2017

Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods
conference, November 2018

Numerical recovery strategies for parallel resilient Krylov linear solvers: RESILIENCY IN KRYLOV LINEAR SOLVERS
journal, August 2016

Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems
conference, October 2013

Debugging and Optimization of HPC Programs with the Verrou Tool
conference, November 2019