Application health monitoring for extreme-scale resiliency using cooperative fault management
Abstract
Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. In this work, we introduce a novel application-driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns, as indicators of an application's health and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics, and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. Lastly, the developed approach is general and can be easily applied to other applications.
- Authors:
-
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Tennessee, Knoxville, TN (United States)
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); IBM Systems, IBM, Rochester, MN (United States)
- Publication Date:
- Research Org.:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
- OSTI Identifier:
- 1558573
- Grant/Contract Number:
- AC05-00OR22725
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Concurrency and Computation. Practice and Experience
- Additional Journal Information:
- Journal Volume: 32; Journal Issue: 2; Journal ID: ISSN 1532-0626
- Publisher:
- Wiley
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; exascale resiliency; fault tolerance; heterogeneous systems; molecular dynamics; quantum chemistry calculations; silent errors
Citation Formats
Agarwal, Pratul K., Naughton, III, Thomas, Park, Byung H., Bernholdt, David E., Hursey, Joshua J., and Geist, II, Al. Application health monitoring for extreme-scale resiliency using cooperative fault management. United States: N. p., 2019.
Web. doi:10.1002/cpe.5449.
Agarwal, Pratul K., Naughton, III, Thomas, Park, Byung H., Bernholdt, David E., Hursey, Joshua J., & Geist, II, Al. Application health monitoring for extreme-scale resiliency using cooperative fault management. United States. doi:https://doi.org/10.1002/cpe.5449
Agarwal, Pratul K., Naughton, III, Thomas, Park, Byung H., Bernholdt, David E., Hursey, Joshua J., and Geist, II, Al. Thu .
"Application health monitoring for extreme-scale resiliency using cooperative fault management". United States. doi:https://doi.org/10.1002/cpe.5449. https://www.osti.gov/servlets/purl/1558573.
@article{osti_1558573,
title = {Application health monitoring for extreme-scale resiliency using cooperative fault management},
author = {Agarwal, Pratul K. and Naughton, III, Thomas and Park, Byung H. and Bernholdt, David E. and Hursey, Joshua J. and Geist, II, Al},
abstractNote = {Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. In this work, we introduce a novel application-driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns, as indicators of an application's health and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics, and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. Lastly, the developed approach is general and can be easily applied to other applications.},
doi = {10.1002/cpe.5449},
journal = {Concurrency and Computation. Practice and Experience},
number = 2,
volume = 32,
place = {United States},
year = {2019},
month = {7}
}
Works referenced in this record:
Leveraging near data processing for high-performance checkpoint/restart
conference, January 2017
- Agrawal, Abhinav; Loh, Gabriel H.; Tuck, James
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
Exploring versioned distributed arrays for resilience in scientific applications: global view resilience
journal, September 2016
- Chien, A.; Balaji, P.; Dun, N.
- The International Journal of High Performance Computing Applications, Vol. 31, Issue 6
What Supercomputers Say: A Study of Five System Logs
conference, June 2007
- Oliner, Adam; Stearley, Jon
- 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
conference, January 2013
- Chen, Zizhong
- Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13
ABFR: convenient management of latent error resilience using application knowledge
conference, January 2018
- Fang, Aiman; Chien, Andrew A.
- Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
conference, September 2009
- Gupta, Rinku; Beckman, Pete; Park, Byung-Hoon
- 2009 International Conference on Parallel Processing (ICPP)
Characterizing the impact of soft errors on iterative methods in scientific computing
conference, January 2011
- Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
- Proceedings of the international conference on Supercomputing - ICS '11
A concise introduction to autonomic computing
journal, July 2005
- Sterritt, Roy; Parashar, Manish; Tianfield, Huaglory
- Advanced Engineering Informatics, Vol. 19, Issue 3
ER einit : Scalable and efficient fault-tolerance for bulk-synchronous MPI applications : ER
journal, August 2018
- Chakraborty, Sourav; Laguna, Ignacio; Emani, Murali
- Concurrency and Computation: Practice and Experience
The ganglia distributed monitoring system: design, implementation, and experience
journal, July 2004
- Massie, Matthew L.; Chun, Brent N.; Culler, David E.
- Parallel Computing, Vol. 30, Issue 7
Application Resilience: Making Progress in Spite of Failure
conference, May 2008
- Jones, William M.; Daly, John T.; DeBardeleben, Nathan A.
- 2008 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
conference, February 2015
- Tiwari, Devesh; Gupta, Saurabh; Rogers, James
- 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)
Failures in large scale systems: long-term measurement, analysis, and implications
conference, January 2017
- Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
A survey of MPI usage in the US exascale computing project: A survey of MPI usage in the U. S. exascale computing project
journal, September 2018
- Bernholdt, David E.; Boehm, Swen; Bosilca, George
- Concurrency and Computation: Practice and Experience
A hybrid fault tolerance scheme for EasyGrid MPI applications
conference, January 2011
- da Silva, Jacques A.; Rebello, Vinod E. F.
- Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science - MGC '11
Resilience for Stencil Computations with Latent Errors
conference, August 2017
- Fang, Aiman; Cavelan, Aurelien; Robert, Yves
- 2017 46th International Conference on Parallel Processing (ICPP)
Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations
conference, November 2010
- Hampton, Scott S.; Alam, Sadaf R.; Crozier, Paul S.
- 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A survey of rollback-recovery protocols in message-passing systems
journal, September 2002
- Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
- ACM Computing Surveys, Vol. 34, Issue 3
Analyzing the soft error resilience of linear solvers on multicore multiprocessors
conference, April 2010
- Malkowski, Konrad; Raghavan, Padma; Kandemir, Mahmut
- 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures: PERFORMANCE MODELING OF MD ON HETEROGENEOUS ARCHITECTURES
journal, October 2012
- Agarwal, Pratul K.; Hampton, Scott; Poznanovic, Jeffrey
- Concurrency and Computation: Practice and Experience, Vol. 25, Issue 10
CoMon: a mostly-scalable monitoring system for PlanetLab
journal, January 2006
- Park, KyoungSoo; Pai, Vivek S.
- ACM SIGOPS Operating Systems Review, Vol. 40, Issue 1
DRAM errors in the wild: a large-scale field study
journal, February 2011
- Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich
- Communications of the ACM, Vol. 54, Issue 2
Toward Exascale Resilience
journal, September 2009
- Cappello, Franck; Geist, Al; Gropp, Bill
- The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
Hierarchical error detection in a software implemented fault tolerance (SIFT) environment
journal, January 2000
- Bagchi, S.; Srinivasan, B.; Whisnant, K.
- IEEE Transactions on Knowledge and Data Engineering, Vol. 12, Issue 2
Enhancing application robustness through adaptive fault tolerance
conference, April 2008
- Lan, Zhiling; Li, Yawei; Zheng, Ziming
- Distributed Processing Symposium (IPDPS), 2008 IEEE International Symposium on Parallel and Distributed Processing
Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo
conference, November 2018
- Levy, Scott; Ferreira, Kurt B.; DeBardeleben, Nathan
- SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
Energy efficient biomolecular simulations with FPGA-based reconfigurable computing
conference, January 2010
- Nallamuthu, Ananth; Smith, Melissa C.; Hampton, Scott
- Proceedings of the 7th ACM international conference on Computing frontiers - CF '10
Understanding failures in petascale computers
journal, July 2007
- Schroeder, Bianca; Gibson, Garth A.
- Journal of Physics: Conference Series, Vol. 78
Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers
conference, September 2017
- Guhur, Pierre-Louis; Constantinescu, Emil; Ghosh, Debojyoti
- 2017 IEEE International Conference on Cluster Computing (CLUSTER)
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
journal, January 2008
- Langou, J.; Chen, Z.; Bosilca, G.
- SIAM Journal on Scientific Computing, Vol. 30, Issue 1
Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques
conference, August 2016
- Gamell, Marc; Katz, Daniel S.; Teranishi, Keita
- 2016 45th International Conference on Parallel Processing Workshops (ICPPW)
Fast Parallel Algorithms for Short-Range Molecular Dynamics
journal, March 1995
- Plimpton, Steve
- Journal of Computational Physics, Vol. 117, Issue 1
Application Fault Tolerance with Armor Middleware
journal, March 2005
- Kalbarczyk, Z.; Iyer, R. K.
- IEEE Internet Computing, Vol. 9, Issue 2
When is multi-version checkpointing needed?
conference, January 2013
- Lu, Guoming; Zheng, Ziming; Chien, Andrew A.
- Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13
OVIS: a tool for intelligent, real-time monitoring of computational clusters
conference, January 2006
- Brandt, J. M.; Gentile, A. C.; Hale, D. J.
- Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
Evaluating the Impact of SDC on the GMRES Iterative Solver
conference, May 2014
- Elliott, James; Hoemmen, Mark; Mueller, Frank
- 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications
conference, November 2016
- Dubey, Anshu; Fujita, Hajime; Graves, Daniel T.
- SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability
conference, January 2007
- Glosli, J. N.; Richards, D. F.; Caspersen, K. J.
- Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984
- Kuang-Hua Huang, ; Abraham, Jacob A.
- IEEE Transactions on Computers, Vol. C-33, Issue 6
PLFS: a checkpoint filesystem for parallel applications
conference, January 2009
- Bent, John; Gibson, Garth; Grider, Gary
General atomic and molecular electronic structure system
journal, November 1993
- Schmidt, Michael W.; Baldridge, Kim K.; Boatz, Jerry A.
- Journal of Computational Chemistry, Vol. 14, Issue 11, p. 1347-1363
A survey of high-performance computing scaling challenges
journal, July 2016
- Geist, Al; Reed, Daniel A.
- The International Journal of High Performance Computing Applications, Vol. 31, Issue 1
A Numerical Soft Fault Model for Iterative Linear Solvers
conference, January 2015
- Elliott, James; Hoemmen, Mark; Mueller, Frank
- Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15
An analysis of latent sector errors in disk drives
conference, January 2007
- Bairavasundaram, Lakshmi N.; Goodson, Garth R.; Pasupathy, Shankar
- Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '07
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
conference, January 2009
- Dong, Xiangyu; Muralimanohar, Naveen; Jouppi, Norm
- Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
Provisioning a Multi-tiered Data Staging Area for Extreme-Scale Machines
conference, June 2011
- Prabhakar, Ramya; Vazhkudai, Sudharshan S.; Kim, Youngjae
- 2011 31st International Conference on Distributed Computing Systems (ICDCS)
Exascale fault tolerance challenge and approaches
conference, March 2018
- McNairy, Cameron
- 2018 IEEE International Reliability Physics Symposium (IRPS)
Realization of User Level Fault Tolerant Policy Management through a Holistic Approach for Fault Correlation
conference, June 2011
- Park, Byung H.; Naughton, Thomas J.; Agarwal, Pratul
- 2011 IEEE International Symposium on Policies for Distributed Systems and Networks - POLICY
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
journal, October 2016
- Di, Sheng; Cappello, Franck
- IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 10
Fault tolerant preconditioned conjugate gradient for sparse linear system solution
conference, January 2012
- Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
- Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
Soft error vulnerability of iterative linear algebra methods
conference, January 2008
- Bronevetsky, Greg; de Supinski, Bronis
- Proceedings of the 22nd annual international conference on Supercomputing - ICS '08
A large-scale study of soft-errors on GPUs in the field
conference, March 2016
- Nie, Bin; Tiwari, Devesh; Gupta, Saurabh
- 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)
Modeling Input-Dependent Error Propagation in Programs
conference, June 2018
- Li, Guanpeng; Pattabiraman, Karthik
- 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Liquid water: obtaining the right answer for the right reasons
conference, January 2009
- Aprà, Edoardo; Rendell, Alistair P.; Harrison, Robert J.
- Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
Liquid water: II. Experimental atom pair-correlation functions of liquid D 2 O
journal, August 1977
- Pálinkás, G.; Kálmán, E.; Kovács, P.
- Molecular Physics, Vol. 34, Issue 2
Understanding Failures in Petascale Computers
text, January 2018
- Schroeder, Bianca; Gibson, Garth
- Figshare
An analysis of latent sector errors in disk drives
journal, June 2007
- Bairavasundaram, Lakshmi N.; Goodson, Garth R.; Pasupathy, Shankar
- ACM SIGMETRICS Performance Evaluation Review, Vol. 35, Issue 1
DRAM errors in the wild: a large-scale field study
conference, January 2009
- Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich
- Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems - SIGMETRICS '09
Works referencing / citing this record:
Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations
conference, November 2010
- Hampton, Scott S.; Alam, Sadaf R.; Crozier, Paul S.
- 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A survey of MPI usage in the US exascale computing project: A survey of MPI usage in the U. S. exascale computing project
journal, September 2018
- Bernholdt, David E.; Boehm, Swen; Bosilca, George
- Concurrency and Computation: Practice and Experience
Resilience for Stencil Computations with Latent Errors
conference, August 2017
- Fang, Aiman; Cavelan, Aurelien; Robert, Yves
- 2017 46th International Conference on Parallel Processing (ICPP)
Application Resilience: Making Progress in Spite of Failure
conference, May 2008
- Jones, William M.; Daly, John T.; DeBardeleben, Nathan A.
- 2008 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
Realization of User Level Fault Tolerant Policy Management through a Holistic Approach for Fault Correlation
conference, June 2011
- Park, Byung H.; Naughton, Thomas J.; Agarwal, Pratul
- 2011 IEEE International Symposium on Policies for Distributed Systems and Networks - POLICY
CoMon: a mostly-scalable monitoring system for PlanetLab
journal, January 2006
- Park, KyoungSoo; Pai, Vivek S.
- ACM SIGOPS Operating Systems Review, Vol. 40, Issue 1
Enhancing application robustness through adaptive fault tolerance
conference, April 2008
- Lan, Zhiling; Li, Yawei; Zheng, Ziming
- Distributed Processing Symposium (IPDPS), 2008 IEEE International Symposium on Parallel and Distributed Processing
An analysis of latent sector errors in disk drives
conference, January 2007
- Bairavasundaram, Lakshmi N.; Goodson, Garth R.; Pasupathy, Shankar
- Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '07
Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers
conference, September 2017
- Guhur, Pierre-Louis; Constantinescu, Emil; Ghosh, Debojyoti
- 2017 IEEE International Conference on Cluster Computing (CLUSTER)
Provisioning a Multi-tiered Data Staging Area for Extreme-Scale Machines
conference, June 2011
- Prabhakar, Ramya; Vazhkudai, Sudharshan S.; Kim, Youngjae
- 2011 31st International Conference on Distributed Computing Systems (ICDCS)
Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures: PERFORMANCE MODELING OF MD ON HETEROGENEOUS ARCHITECTURES
journal, October 2012
- Agarwal, Pratul K.; Hampton, Scott; Poznanovic, Jeffrey
- Concurrency and Computation: Practice and Experience, Vol. 25, Issue 10
A large-scale study of soft-errors on GPUs in the field
conference, March 2016
- Nie, Bin; Tiwari, Devesh; Gupta, Saurabh
- 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)
A survey of high-performance computing scaling challenges
journal, July 2016
- Geist, Al; Reed, Daniel A.
- The International Journal of High Performance Computing Applications, Vol. 31, Issue 1
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
journal, October 2016
- Di, Sheng; Cappello, Franck
- IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 10
OVIS: a tool for intelligent, real-time monitoring of computational clusters
conference, January 2006
- Brandt, J. M.; Gentile, A. C.; Hale, D. J.
- Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
Soft error vulnerability of iterative linear algebra methods
conference, January 2008
- Bronevetsky, Greg; de Supinski, Bronis
- Proceedings of the 22nd annual international conference on Supercomputing - ICS '08
Fast Parallel Algorithms for Short-Range Molecular Dynamics
journal, March 1995
- Plimpton, Steve
- Journal of Computational Physics, Vol. 117, Issue 1
Understanding failures in petascale computers
journal, July 2007
- Schroeder, Bianca; Gibson, Garth A.
- Journal of Physics: Conference Series, Vol. 78
What Supercomputers Say: A Study of Five System Logs
conference, June 2007
- Oliner, Adam; Stearley, Jon
- 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
Fault tolerant preconditioned conjugate gradient for sparse linear system solution
conference, January 2012
- Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
- Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
journal, January 2008
- Langou, J.; Chen, Z.; Bosilca, G.
- SIAM Journal on Scientific Computing, Vol. 30, Issue 1
A Numerical Soft Fault Model for Iterative Linear Solvers
conference, January 2015
- Elliott, James; Hoemmen, Mark; Mueller, Frank
- Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15
ER einit : Scalable and efficient fault-tolerance for bulk-synchronous MPI applications : ER
journal, August 2018
- Chakraborty, Sourav; Laguna, Ignacio; Emani, Murali
- Concurrency and Computation: Practice and Experience
When is multi-version checkpointing needed?
conference, January 2013
- Lu, Guoming; Zheng, Ziming; Chien, Andrew A.
- Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13
General atomic and molecular electronic structure system
journal, November 1993
- Schmidt, Michael W.; Baldridge, Kim K.; Boatz, Jerry A.
- Journal of Computational Chemistry, Vol. 14, Issue 11, p. 1347-1363
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
conference, September 2009
- Gupta, Rinku; Beckman, Pete; Park, Byung-Hoon
- 2009 International Conference on Parallel Processing (ICPP)
Characterizing the impact of soft errors on iterative methods in scientific computing
conference, January 2011
- Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
- Proceedings of the international conference on Supercomputing - ICS '11
Toward Exascale Resilience
journal, September 2009
- Cappello, Franck; Geist, Al; Gropp, Bill
- The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
conference, February 2015
- Tiwari, Devesh; Gupta, Saurabh; Rogers, James
- 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)
ABFR: convenient management of latent error resilience using application knowledge
conference, January 2018
- Fang, Aiman; Chien, Andrew A.
- Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18
A concise introduction to autonomic computing
journal, July 2005
- Sterritt, Roy; Parashar, Manish; Tianfield, Huaglory
- Advanced Engineering Informatics, Vol. 19, Issue 3
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
conference, January 2009
- Dong, Xiangyu; Muralimanohar, Naveen; Jouppi, Norm
- Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
A survey of rollback-recovery protocols in message-passing systems
journal, September 2002
- Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
- ACM Computing Surveys, Vol. 34, Issue 3
Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984
- Kuang-Hua Huang, ; Abraham, Jacob A.
- IEEE Transactions on Computers, Vol. C-33, Issue 6
PLFS: a checkpoint filesystem for parallel applications
conference, January 2009
- Bent, John; Gibson, Garth; Grider, Gary
Hierarchical error detection in a software implemented fault tolerance (SIFT) environment
journal, January 2000
- Bagchi, S.; Srinivasan, B.; Whisnant, K.
- IEEE Transactions on Knowledge and Data Engineering, Vol. 12, Issue 2
Leveraging near data processing for high-performance checkpoint/restart
conference, January 2017
- Agrawal, Abhinav; Loh, Gabriel H.; Tuck, James
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
Liquid water: obtaining the right answer for the right reasons
conference, January 2009
- Aprà, Edoardo; Rendell, Alistair P.; Harrison, Robert J.
- Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
conference, January 2013
- Chen, Zizhong
- Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13
Exascale fault tolerance challenge and approaches
conference, March 2018
- McNairy, Cameron
- 2018 IEEE International Reliability Physics Symposium (IRPS)
Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo
conference, November 2018
- Levy, Scott; Ferreira, Kurt B.; DeBardeleben, Nathan
- SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
DRAM errors in the wild: a large-scale field study
journal, February 2011
- Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich
- Communications of the ACM, Vol. 54, Issue 2
Analyzing the soft error resilience of linear solvers on multicore multiprocessors
conference, April 2010
- Malkowski, Konrad; Raghavan, Padma; Kandemir, Mahmut
- 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability
conference, January 2007
- Glosli, J. N.; Richards, D. F.; Caspersen, K. J.
- Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
Exploring versioned distributed arrays for resilience in scientific applications: global view resilience
journal, September 2016
- Chien, A.; Balaji, P.; Dun, N.
- The International Journal of High Performance Computing Applications, Vol. 31, Issue 6
Energy efficient biomolecular simulations with FPGA-based reconfigurable computing
conference, January 2010
- Nallamuthu, Ananth; Smith, Melissa C.; Hampton, Scott
- Proceedings of the 7th ACM international conference on Computing frontiers - CF '10
Modeling Input-Dependent Error Propagation in Programs
conference, June 2018
- Li, Guanpeng; Pattabiraman, Karthik
- 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Evaluating the Impact of SDC on the GMRES Iterative Solver
conference, May 2014
- Elliott, James; Hoemmen, Mark; Mueller, Frank
- 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications
conference, November 2016
- Dubey, Anshu; Fujita, Hajime; Graves, Daniel T.
- SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
A hybrid fault tolerance scheme for EasyGrid MPI applications
conference, January 2011
- da Silva, Jacques A.; Rebello, Vinod E. F.
- Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science - MGC '11
Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques
conference, August 2016
- Gamell, Marc; Katz, Daniel S.; Teranishi, Keita
- 2016 45th International Conference on Parallel Processing Workshops (ICPPW)
Failures in large scale systems: long-term measurement, analysis, and implications
conference, January 2017
- Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
The ganglia distributed monitoring system: design, implementation, and experience
journal, July 2004
- Massie, Matthew L.; Chun, Brent N.; Culler, David E.
- Parallel Computing, Vol. 30, Issue 7
Application Fault Tolerance with Armor Middleware
journal, March 2005
- Kalbarczyk, Z.; Iyer, R. K.
- IEEE Internet Computing, Vol. 9, Issue 2