Here this work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.
Agullo, Emmanuel, et al. "Resiliency in numerical algorithm design for extreme scale simulations." International Journal of High Performance Computing Applications, vol. 36, no. 2, Dec. 2021. https://doi.org/10.1177/10943420211055188
Agullo, Emmanuel, Altenbernd, Mirco, Anzt, Hartwig, et al., "Resiliency in numerical algorithm design for extreme scale simulations," International Journal of High Performance Computing Applications 36, no. 2 (2021), https://doi.org/10.1177/10943420211055188
@article{osti_1855669,
author = {Agullo, Emmanuel and Altenbernd, Mirco and Anzt, Hartwig and Bautista-Gomez, Leonardo and Benacchio, Tommaso and Bonaventura, Luca and Bungartz, Hans-Joachim and Chatterjee, Sanjay and Ciorba, Florina M. and DeBardeleben, Nathan and others},
title = {Resiliency in numerical algorithm design for extreme scale simulations},
annote = {Here this work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.},
doi = {10.1177/10943420211055188},
url = {https://www.osti.gov/biblio/1855669},
journal = {International Journal of High Performance Computing Applications},
issn = {ISSN 1094-3420},
number = {2},
volume = {36},
place = {United States},
publisher = {SAGE},
year = {2021},
month = {12}}
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States); Los Alamos National Laboratory (LANL), Los Alamos, NM (United States); Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States); Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1855669
Journal Information:
International Journal of High Performance Computing Applications, Journal Name: International Journal of High Performance Computing Applications Journal Issue: 2 Vol. 36; ISSN 1094-3420
HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scalehttps://doi.org/10.1145/2909428.2909431
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17https://doi.org/10.1145/3126908.3126943
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 367, Issue 1907https://doi.org/10.1098/rsta.2009.0155
2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)https://doi.org/10.1109/DSNW.2012.6264669
2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)https://doi.org/10.1109/DSNW.2012.6264672
2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012)https://doi.org/10.1109/DSNW.2012.6264677
2008 9th International Conference for Young Computer Scientists (ICYCS), 2008 The 9th International Conference for Young Computer Scientistshttps://doi.org/10.1109/ICYCS.2008.329
MICRO-32. 32nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecturehttps://doi.org/10.1109/MICRO.1999.809458
2015 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processinghttps://doi.org/10.1109/PDP.2015.17
2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1109/SC.2010.18
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1109/SC.2012.46
Fiala, David; Mueller, Frank; Engelmann, Christian
2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1109/SC.2012.49
2011 IEEE International Conference on Space Mission Challenges for Information Technology (SMC-IT), 2011 IEEE Fourth International Conference on Space Mission Challenges for Information Technologyhttps://doi.org/10.1109/SMC-IT.2011.29
2007 Signal Processing Algorithms, Architectures, Arrangements, and Applications (SPA 2007), Signal Processing Algorithms, Architectures, Arrangements, and Applications SPA 2007https://doi.org/10.1109/SPA.2007.5903294
Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications - OOPSLA '93https://doi.org/10.1145/165854.165874
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11https://doi.org/10.1145/2063384.2063427
SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1145/2503210.2503249
SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1145/2503210.2503271
HPDC'15: The 24th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scalehttps://doi.org/10.1145/2751504.2751507
SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1145/2807591.2807599
SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1145/2807591.2807640
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing - HPDC '16https://doi.org/10.1145/2907294.2907306
HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computinghttps://doi.org/10.1145/2907294.2907315
HPDC'16: The 25th International Symposium on High-Performance Parallel and Distributed Computing, Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scalehttps://doi.org/10.1145/2909428.2909429
ICS '16: 2016 International Conference on Supercomputing, Proceedings of the 2016 International Conference on Supercomputinghttps://doi.org/10.1145/2925426.2926295
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17https://doi.org/10.1145/3126908.3126915
Georgakoudis, Giorgis; Laguna, Ignacio; Nikolopoulos, Dimitrios S.
SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysishttps://doi.org/10.1145/3126908.3126972
EuroPLoP '17: European Conference on Pattern Languages of Programs, Proceedings of the 22nd European Conference on Pattern Languages of Programshttps://doi.org/10.1145/3147704.3147718
Obersteiner, Michael; Hinojosa, Alfredo Parra; Heene, Mario
SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systemshttps://doi.org/10.1145/3148226.3148229
Ashraf, Rizwan A.; Hukerikar, Saurabh; Engelmann, Christian
ICPE '18: ACM/SPEC International Conference on Performance Engineering, Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineeringhttps://doi.org/10.1145/3184407.3184421
Pachajoa, Carlos; Levonyak, Markus; Gansterer, Wilfried N.
ICPP 2019: 48th International Conference on Parallel Processing, Proceedings of the 48th International Conference on Parallel Processinghttps://doi.org/10.1145/3337821.3337849