skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Application health monitoring for extreme-scale resiliency using cooperative fault management

Abstract

Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. In this work, we introduce a novel application-driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns, as indicators of an application's health and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics, and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. Lastly, the developed approach is general and can be easily applied to other applications.

Authors:
ORCiD logo [1]; ORCiD logo [2]; ORCiD logo [2]; ORCiD logo [2];  [3]; ORCiD logo [2]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Tennessee, Knoxville, TN (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  3. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); IBM Systems, IBM, Rochester, MN (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1558573
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
Concurrency and Computation. Practice and Experience
Additional Journal Information:
Journal Volume: 32; Journal Issue: 2; Journal ID: ISSN 1532-0626
Publisher:
Wiley
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; exascale resiliency; fault tolerance; heterogeneous systems; molecular dynamics; quantum chemistry calculations; silent errors

Citation Formats

Agarwal, Pratul K., Naughton, III, Thomas, Park, Byung H., Bernholdt, David E., Hursey, Joshua J., and Geist, II, Al. Application health monitoring for extreme-scale resiliency using cooperative fault management. United States: N. p., 2019. Web. doi:10.1002/cpe.5449.
Agarwal, Pratul K., Naughton, III, Thomas, Park, Byung H., Bernholdt, David E., Hursey, Joshua J., & Geist, II, Al. Application health monitoring for extreme-scale resiliency using cooperative fault management. United States. doi:https://doi.org/10.1002/cpe.5449
Agarwal, Pratul K., Naughton, III, Thomas, Park, Byung H., Bernholdt, David E., Hursey, Joshua J., and Geist, II, Al. Thu . "Application health monitoring for extreme-scale resiliency using cooperative fault management". United States. doi:https://doi.org/10.1002/cpe.5449. https://www.osti.gov/servlets/purl/1558573.
@article{osti_1558573,
title = {Application health monitoring for extreme-scale resiliency using cooperative fault management},
author = {Agarwal, Pratul K. and Naughton, III, Thomas and Park, Byung H. and Bernholdt, David E. and Hursey, Joshua J. and Geist, II, Al},
abstractNote = {Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. In this work, we introduce a novel application-driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns, as indicators of an application's health and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics, and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. Lastly, the developed approach is general and can be easily applied to other applications.},
doi = {10.1002/cpe.5449},
journal = {Concurrency and Computation. Practice and Experience},
number = 2,
volume = 32,
place = {United States},
year = {2019},
month = {7}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Save / Share:

Works referenced in this record:

Leveraging near data processing for high-performance checkpoint/restart
conference, January 2017

  • Agrawal, Abhinav; Loh, Gabriel H.; Tuck, James
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • DOI: 10.1145/3126908.3126918

Exploring versioned distributed arrays for resilience in scientific applications: global view resilience
journal, September 2016

  • Chien, A.; Balaji, P.; Dun, N.
  • The International Journal of High Performance Computing Applications, Vol. 31, Issue 6
  • DOI: 10.1177/1094342016664796

What Supercomputers Say: A Study of Five System Logs
conference, June 2007

  • Oliner, Adam; Stearley, Jon
  • 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
  • DOI: 10.1109/DSN.2007.103

Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
conference, January 2013

  • Chen, Zizhong
  • Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13
  • DOI: 10.1145/2442516.2442533

ABFR: convenient management of latent error resilience using application knowledge
conference, January 2018

  • Fang, Aiman; Chien, Andrew A.
  • Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18
  • DOI: 10.1145/3208040.3208046

CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
conference, September 2009

  • Gupta, Rinku; Beckman, Pete; Park, Byung-Hoon
  • 2009 International Conference on Parallel Processing (ICPP)
  • DOI: 10.1109/ICPP.2009.20

Characterizing the impact of soft errors on iterative methods in scientific computing
conference, January 2011

  • Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
  • Proceedings of the international conference on Supercomputing - ICS '11
  • DOI: 10.1145/1995896.1995922

A concise introduction to autonomic computing
journal, July 2005

  • Sterritt, Roy; Parashar, Manish; Tianfield, Huaglory
  • Advanced Engineering Informatics, Vol. 19, Issue 3
  • DOI: 10.1016/j.aei.2005.05.012

ER einit : Scalable and efficient fault-tolerance for bulk-synchronous MPI applications : ER
journal, August 2018

  • Chakraborty, Sourav; Laguna, Ignacio; Emani, Murali
  • Concurrency and Computation: Practice and Experience
  • DOI: 10.1002/cpe.4863

The ganglia distributed monitoring system: design, implementation, and experience
journal, July 2004


Application Resilience: Making Progress in Spite of Failure
conference, May 2008

  • Jones, William M.; Daly, John T.; DeBardeleben, Nathan A.
  • 2008 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
  • DOI: 10.1109/CCGRID.2008.99

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
conference, February 2015

  • Tiwari, Devesh; Gupta, Saurabh; Rogers, James
  • 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)
  • DOI: 10.1109/HPCA.2015.7056044

Failures in large scale systems: long-term measurement, analysis, and implications
conference, January 2017

  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • DOI: 10.1145/3126908.3126937

A survey of MPI usage in the US exascale computing project: A survey of MPI usage in the U. S. exascale computing project
journal, September 2018

  • Bernholdt, David E.; Boehm, Swen; Bosilca, George
  • Concurrency and Computation: Practice and Experience
  • DOI: 10.1002/cpe.4851

A hybrid fault tolerance scheme for EasyGrid MPI applications
conference, January 2011

  • da Silva, Jacques A.; Rebello, Vinod E. F.
  • Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science - MGC '11
  • DOI: 10.1145/2089002.2089006

Resilience for Stencil Computations with Latent Errors
conference, August 2017

  • Fang, Aiman; Cavelan, Aurelien; Robert, Yves
  • 2017 46th International Conference on Parallel Processing (ICPP)
  • DOI: 10.1109/ICPP.2017.67

Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations
conference, November 2010

  • Hampton, Scott S.; Alam, Sadaf R.; Crozier, Paul S.
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2010.37

A survey of rollback-recovery protocols in message-passing systems
journal, September 2002

  • Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
  • ACM Computing Surveys, Vol. 34, Issue 3
  • DOI: 10.1145/568522.568525

Analyzing the soft error resilience of linear solvers on multicore multiprocessors
conference, April 2010

  • Malkowski, Konrad; Raghavan, Padma; Kandemir, Mahmut
  • 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
  • DOI: 10.1109/IPDPS.2010.5470411

Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures: PERFORMANCE MODELING OF MD ON HETEROGENEOUS ARCHITECTURES
journal, October 2012

  • Agarwal, Pratul K.; Hampton, Scott; Poznanovic, Jeffrey
  • Concurrency and Computation: Practice and Experience, Vol. 25, Issue 10
  • DOI: 10.1002/cpe.2943

CoMon: a mostly-scalable monitoring system for PlanetLab
journal, January 2006

  • Park, KyoungSoo; Pai, Vivek S.
  • ACM SIGOPS Operating Systems Review, Vol. 40, Issue 1
  • DOI: 10.1145/1113361.1113374

DRAM errors in the wild: a large-scale field study
journal, February 2011

  • Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich
  • Communications of the ACM, Vol. 54, Issue 2
  • DOI: 10.1145/1897816.1897844

Toward Exascale Resilience
journal, September 2009

  • Cappello, Franck; Geist, Al; Gropp, Bill
  • The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
  • DOI: 10.1177/1094342009347767

Hierarchical error detection in a software implemented fault tolerance (SIFT) environment
journal, January 2000

  • Bagchi, S.; Srinivasan, B.; Whisnant, K.
  • IEEE Transactions on Knowledge and Data Engineering, Vol. 12, Issue 2
  • DOI: 10.1109/69.842263

Enhancing application robustness through adaptive fault tolerance
conference, April 2008

  • Lan, Zhiling; Li, Yawei; Zheng, Ziming
  • Distributed Processing Symposium (IPDPS), 2008 IEEE International Symposium on Parallel and Distributed Processing
  • DOI: 10.1109/IPDPS.2008.4536383

Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo
conference, November 2018

  • Levy, Scott; Ferreira, Kurt B.; DeBardeleben, Nathan
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2018.00046

Energy efficient biomolecular simulations with FPGA-based reconfigurable computing
conference, January 2010

  • Nallamuthu, Ananth; Smith, Melissa C.; Hampton, Scott
  • Proceedings of the 7th ACM international conference on Computing frontiers - CF '10
  • DOI: 10.1145/1787275.1787294

Understanding failures in petascale computers
journal, July 2007


Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers
conference, September 2017

  • Guhur, Pierre-Louis; Constantinescu, Emil; Ghosh, Debojyoti
  • 2017 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2017.13

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
journal, January 2008

  • Langou, J.; Chen, Z.; Bosilca, G.
  • SIAM Journal on Scientific Computing, Vol. 30, Issue 1
  • DOI: 10.1137/040620394

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques
conference, August 2016

  • Gamell, Marc; Katz, Daniel S.; Teranishi, Keita
  • 2016 45th International Conference on Parallel Processing Workshops (ICPPW)
  • DOI: 10.1109/ICPPW.2016.56

Fast Parallel Algorithms for Short-Range Molecular Dynamics
journal, March 1995


Application Fault Tolerance with Armor Middleware
journal, March 2005

  • Kalbarczyk, Z.; Iyer, R. K.
  • IEEE Internet Computing, Vol. 9, Issue 2
  • DOI: 10.1109/MIC.2005.31

When is multi-version checkpointing needed?
conference, January 2013

  • Lu, Guoming; Zheng, Ziming; Chien, Andrew A.
  • Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13
  • DOI: 10.1145/2465813.2465821

OVIS: a tool for intelligent, real-time monitoring of computational clusters
conference, January 2006

  • Brandt, J. M.; Gentile, A. C.; Hale, D. J.
  • Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2006.1639698

Evaluating the Impact of SDC on the GMRES Iterative Solver
conference, May 2014

  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.123

Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications
conference, November 2016

  • Dubey, Anshu; Fujita, Hajime; Graves, Daniel T.
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2016.41

Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability
conference, January 2007

  • Glosli, J. N.; Richards, D. F.; Caspersen, K. J.
  • Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
  • DOI: 10.1145/1362622.1362700

Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984

  • Kuang-Hua Huang, ; Abraham, Jacob A.
  • IEEE Transactions on Computers, Vol. C-33, Issue 6
  • DOI: 10.1109/TC.1984.1676475

PLFS: a checkpoint filesystem for parallel applications
conference, January 2009


General atomic and molecular electronic structure system
journal, November 1993

  • Schmidt, Michael W.; Baldridge, Kim K.; Boatz, Jerry A.
  • Journal of Computational Chemistry, Vol. 14, Issue 11, p. 1347-1363
  • DOI: 10.1002/jcc.540141112

A survey of high-performance computing scaling challenges
journal, July 2016

  • Geist, Al; Reed, Daniel A.
  • The International Journal of High Performance Computing Applications, Vol. 31, Issue 1
  • DOI: 10.1177/1094342015597083

A Numerical Soft Fault Model for Iterative Linear Solvers
conference, January 2015

  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15
  • DOI: 10.1145/2749246.2749254

An analysis of latent sector errors in disk drives
conference, January 2007

  • Bairavasundaram, Lakshmi N.; Goodson, Garth R.; Pasupathy, Shankar
  • Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '07
  • DOI: 10.1145/1254882.1254917

Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
conference, January 2009

  • Dong, Xiangyu; Muralimanohar, Naveen; Jouppi, Norm
  • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
  • DOI: 10.1145/1654059.1654117

Provisioning a Multi-tiered Data Staging Area for Extreme-Scale Machines
conference, June 2011

  • Prabhakar, Ramya; Vazhkudai, Sudharshan S.; Kim, Youngjae
  • 2011 31st International Conference on Distributed Computing Systems (ICDCS)
  • DOI: 10.1109/ICDCS.2011.33

Pentium FDIV flaw-lessons learned
journal, April 1995


Exascale fault tolerance challenge and approaches
conference, March 2018


Realization of User Level Fault Tolerant Policy Management through a Holistic Approach for Fault Correlation
conference, June 2011

  • Park, Byung H.; Naughton, Thomas J.; Agarwal, Pratul
  • 2011 IEEE International Symposium on Policies for Distributed Systems and Networks - POLICY
  • DOI: 10.1109/POLICY.2011.34

Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
journal, October 2016

  • Di, Sheng; Cappello, Franck
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 10
  • DOI: 10.1109/TPDS.2016.2517639

Fault tolerant preconditioned conjugate gradient for sparse linear system solution
conference, January 2012

  • Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
  • Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
  • DOI: 10.1145/2304576.2304588

Soft error vulnerability of iterative linear algebra methods
conference, January 2008

  • Bronevetsky, Greg; de Supinski, Bronis
  • Proceedings of the 22nd annual international conference on Supercomputing - ICS '08
  • DOI: 10.1145/1375527.1375552

A large-scale study of soft-errors on GPUs in the field
conference, March 2016

  • Nie, Bin; Tiwari, Devesh; Gupta, Saurabh
  • 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)
  • DOI: 10.1109/HPCA.2016.7446091

Modeling Input-Dependent Error Propagation in Programs
conference, June 2018

  • Li, Guanpeng; Pattabiraman, Karthik
  • 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • DOI: 10.1109/DSN.2018.00038

Liquid water: obtaining the right answer for the right reasons
conference, January 2009

  • Aprà, Edoardo; Rendell, Alistair P.; Harrison, Robert J.
  • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
  • DOI: 10.1145/1654059.1654127

Liquid water: II. Experimental atom pair-correlation functions of liquid D 2 O
journal, August 1977


Understanding Failures in Petascale Computers
text, January 2018


An analysis of latent sector errors in disk drives
journal, June 2007

  • Bairavasundaram, Lakshmi N.; Goodson, Garth R.; Pasupathy, Shankar
  • ACM SIGMETRICS Performance Evaluation Review, Vol. 35, Issue 1
  • DOI: 10.1145/1269899.1254917

DRAM errors in the wild: a large-scale field study
conference, January 2009

  • Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich
  • Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems - SIGMETRICS '09
  • DOI: 10.1145/1555349.1555372

    Works referencing / citing this record:

    Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations
    conference, November 2010

    • Hampton, Scott S.; Alam, Sadaf R.; Crozier, Paul S.
    • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
    • DOI: 10.1109/sc.2010.37

    A survey of MPI usage in the US exascale computing project: A survey of MPI usage in the U. S. exascale computing project
    journal, September 2018

    • Bernholdt, David E.; Boehm, Swen; Bosilca, George
    • Concurrency and Computation: Practice and Experience
    • DOI: 10.1002/cpe.4851

    Resilience for Stencil Computations with Latent Errors
    conference, August 2017

    • Fang, Aiman; Cavelan, Aurelien; Robert, Yves
    • 2017 46th International Conference on Parallel Processing (ICPP)
    • DOI: 10.1109/icpp.2017.67

    Application Resilience: Making Progress in Spite of Failure
    conference, May 2008

    • Jones, William M.; Daly, John T.; DeBardeleben, Nathan A.
    • 2008 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
    • DOI: 10.1109/ccgrid.2008.99

    Realization of User Level Fault Tolerant Policy Management through a Holistic Approach for Fault Correlation
    conference, June 2011

    • Park, Byung H.; Naughton, Thomas J.; Agarwal, Pratul
    • 2011 IEEE International Symposium on Policies for Distributed Systems and Networks - POLICY
    • DOI: 10.1109/policy.2011.34

    CoMon: a mostly-scalable monitoring system for PlanetLab
    journal, January 2006

    • Park, KyoungSoo; Pai, Vivek S.
    • ACM SIGOPS Operating Systems Review, Vol. 40, Issue 1
    • DOI: 10.1145/1113361.1113374

    Enhancing application robustness through adaptive fault tolerance
    conference, April 2008

    • Lan, Zhiling; Li, Yawei; Zheng, Ziming
    • Distributed Processing Symposium (IPDPS), 2008 IEEE International Symposium on Parallel and Distributed Processing
    • DOI: 10.1109/ipdps.2008.4536383

    An analysis of latent sector errors in disk drives
    conference, January 2007

    • Bairavasundaram, Lakshmi N.; Goodson, Garth R.; Pasupathy, Shankar
    • Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '07
    • DOI: 10.1145/1254882.1254917

    Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers
    conference, September 2017

    • Guhur, Pierre-Louis; Constantinescu, Emil; Ghosh, Debojyoti
    • 2017 IEEE International Conference on Cluster Computing (CLUSTER)
    • DOI: 10.1109/cluster.2017.13

    Provisioning a Multi-tiered Data Staging Area for Extreme-Scale Machines
    conference, June 2011

    • Prabhakar, Ramya; Vazhkudai, Sudharshan S.; Kim, Youngjae
    • 2011 31st International Conference on Distributed Computing Systems (ICDCS)
    • DOI: 10.1109/icdcs.2011.33

    Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures: PERFORMANCE MODELING OF MD ON HETEROGENEOUS ARCHITECTURES
    journal, October 2012

    • Agarwal, Pratul K.; Hampton, Scott; Poznanovic, Jeffrey
    • Concurrency and Computation: Practice and Experience, Vol. 25, Issue 10
    • DOI: 10.1002/cpe.2943

    A large-scale study of soft-errors on GPUs in the field
    conference, March 2016

    • Nie, Bin; Tiwari, Devesh; Gupta, Saurabh
    • 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)
    • DOI: 10.1109/hpca.2016.7446091

    A survey of high-performance computing scaling challenges
    journal, July 2016

    • Geist, Al; Reed, Daniel A.
    • The International Journal of High Performance Computing Applications, Vol. 31, Issue 1
    • DOI: 10.1177/1094342015597083

    Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
    journal, October 2016

    • Di, Sheng; Cappello, Franck
    • IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 10
    • DOI: 10.1109/tpds.2016.2517639

    OVIS: a tool for intelligent, real-time monitoring of computational clusters
    conference, January 2006

    • Brandt, J. M.; Gentile, A. C.; Hale, D. J.
    • Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
    • DOI: 10.1109/ipdps.2006.1639698

    Soft error vulnerability of iterative linear algebra methods
    conference, January 2008

    • Bronevetsky, Greg; de Supinski, Bronis
    • Proceedings of the 22nd annual international conference on Supercomputing - ICS '08
    • DOI: 10.1145/1375527.1375552

    Fast Parallel Algorithms for Short-Range Molecular Dynamics
    journal, March 1995


    Understanding failures in petascale computers
    journal, July 2007


    What Supercomputers Say: A Study of Five System Logs
    conference, June 2007

    • Oliner, Adam; Stearley, Jon
    • 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
    • DOI: 10.1109/dsn.2007.103

    Fault tolerant preconditioned conjugate gradient for sparse linear system solution
    conference, January 2012

    • Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
    • Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
    • DOI: 10.1145/2304576.2304588

    Pentium FDIV flaw-lessons learned
    journal, April 1995


    Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
    journal, January 2008

    • Langou, J.; Chen, Z.; Bosilca, G.
    • SIAM Journal on Scientific Computing, Vol. 30, Issue 1
    • DOI: 10.1137/040620394

    A Numerical Soft Fault Model for Iterative Linear Solvers
    conference, January 2015

    • Elliott, James; Hoemmen, Mark; Mueller, Frank
    • Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15
    • DOI: 10.1145/2749246.2749254

    ER einit : Scalable and efficient fault-tolerance for bulk-synchronous MPI applications : ER
    journal, August 2018

    • Chakraborty, Sourav; Laguna, Ignacio; Emani, Murali
    • Concurrency and Computation: Practice and Experience
    • DOI: 10.1002/cpe.4863

    When is multi-version checkpointing needed?
    conference, January 2013

    • Lu, Guoming; Zheng, Ziming; Chien, Andrew A.
    • Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13
    • DOI: 10.1145/2465813.2465821

    General atomic and molecular electronic structure system
    journal, November 1993

    • Schmidt, Michael W.; Baldridge, Kim K.; Boatz, Jerry A.
    • Journal of Computational Chemistry, Vol. 14, Issue 11, p. 1347-1363
    • DOI: 10.1002/jcc.540141112

    CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
    conference, September 2009

    • Gupta, Rinku; Beckman, Pete; Park, Byung-Hoon
    • 2009 International Conference on Parallel Processing (ICPP)
    • DOI: 10.1109/icpp.2009.20

    Characterizing the impact of soft errors on iterative methods in scientific computing
    conference, January 2011

    • Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
    • Proceedings of the international conference on Supercomputing - ICS '11
    • DOI: 10.1145/1995896.1995922

    Toward Exascale Resilience
    journal, September 2009

    • Cappello, Franck; Geist, Al; Gropp, Bill
    • The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
    • DOI: 10.1177/1094342009347767

    Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
    conference, February 2015

    • Tiwari, Devesh; Gupta, Saurabh; Rogers, James
    • 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)
    • DOI: 10.1109/hpca.2015.7056044

    ABFR: convenient management of latent error resilience using application knowledge
    conference, January 2018

    • Fang, Aiman; Chien, Andrew A.
    • Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18
    • DOI: 10.1145/3208040.3208046

    A concise introduction to autonomic computing
    journal, July 2005

    • Sterritt, Roy; Parashar, Manish; Tianfield, Huaglory
    • Advanced Engineering Informatics, Vol. 19, Issue 3
    • DOI: 10.1016/j.aei.2005.05.012

    Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
    conference, January 2009

    • Dong, Xiangyu; Muralimanohar, Naveen; Jouppi, Norm
    • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
    • DOI: 10.1145/1654059.1654117

    A survey of rollback-recovery protocols in message-passing systems
    journal, September 2002

    • Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
    • ACM Computing Surveys, Vol. 34, Issue 3
    • DOI: 10.1145/568522.568525

    Algorithm-Based Fault Tolerance for Matrix Operations
    journal, June 1984

    • Kuang-Hua Huang, ; Abraham, Jacob A.
    • IEEE Transactions on Computers, Vol. C-33, Issue 6
    • DOI: 10.1109/tc.1984.1676475

    PLFS: a checkpoint filesystem for parallel applications
    conference, January 2009


    Hierarchical error detection in a software implemented fault tolerance (SIFT) environment
    journal, January 2000

    • Bagchi, S.; Srinivasan, B.; Whisnant, K.
    • IEEE Transactions on Knowledge and Data Engineering, Vol. 12, Issue 2
    • DOI: 10.1109/69.842263

    Leveraging near data processing for high-performance checkpoint/restart
    conference, January 2017

    • Agrawal, Abhinav; Loh, Gabriel H.; Tuck, James
    • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
    • DOI: 10.1145/3126908.3126918

    Liquid water: obtaining the right answer for the right reasons
    conference, January 2009

    • Aprà, Edoardo; Rendell, Alistair P.; Harrison, Robert J.
    • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
    • DOI: 10.1145/1654059.1654127

    Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
    conference, January 2013

    • Chen, Zizhong
    • Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13
    • DOI: 10.1145/2442516.2442533

    Exascale fault tolerance challenge and approaches
    conference, March 2018


    Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo
    conference, November 2018

    • Levy, Scott; Ferreira, Kurt B.; DeBardeleben, Nathan
    • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
    • DOI: 10.1109/sc.2018.00046

    DRAM errors in the wild: a large-scale field study
    journal, February 2011

    • Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich
    • Communications of the ACM, Vol. 54, Issue 2
    • DOI: 10.1145/1897816.1897844

    Analyzing the soft error resilience of linear solvers on multicore multiprocessors
    conference, April 2010

    • Malkowski, Konrad; Raghavan, Padma; Kandemir, Mahmut
    • 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
    • DOI: 10.1109/ipdps.2010.5470411

    Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability
    conference, January 2007

    • Glosli, J. N.; Richards, D. F.; Caspersen, K. J.
    • Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
    • DOI: 10.1145/1362622.1362700

    Exploring versioned distributed arrays for resilience in scientific applications: global view resilience
    journal, September 2016

    • Chien, A.; Balaji, P.; Dun, N.
    • The International Journal of High Performance Computing Applications, Vol. 31, Issue 6
    • DOI: 10.1177/1094342016664796

    Energy efficient biomolecular simulations with FPGA-based reconfigurable computing
    conference, January 2010

    • Nallamuthu, Ananth; Smith, Melissa C.; Hampton, Scott
    • Proceedings of the 7th ACM international conference on Computing frontiers - CF '10
    • DOI: 10.1145/1787275.1787294

    Modeling Input-Dependent Error Propagation in Programs
    conference, June 2018

    • Li, Guanpeng; Pattabiraman, Karthik
    • 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
    • DOI: 10.1109/dsn.2018.00038

    Evaluating the Impact of SDC on the GMRES Iterative Solver
    conference, May 2014

    • Elliott, James; Hoemmen, Mark; Mueller, Frank
    • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
    • DOI: 10.1109/ipdps.2014.123

    Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications
    conference, November 2016

    • Dubey, Anshu; Fujita, Hajime; Graves, Daniel T.
    • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
    • DOI: 10.1109/sc.2016.41

    A hybrid fault tolerance scheme for EasyGrid MPI applications
    conference, January 2011

    • da Silva, Jacques A.; Rebello, Vinod E. F.
    • Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science - MGC '11
    • DOI: 10.1145/2089002.2089006

    Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques
    conference, August 2016

    • Gamell, Marc; Katz, Daniel S.; Teranishi, Keita
    • 2016 45th International Conference on Parallel Processing Workshops (ICPPW)
    • DOI: 10.1109/icppw.2016.56

    Failures in large scale systems: long-term measurement, analysis, and implications
    conference, January 2017

    • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
    • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
    • DOI: 10.1145/3126908.3126937

    The ganglia distributed monitoring system: design, implementation, and experience
    journal, July 2004


    Application Fault Tolerance with Armor Middleware
    journal, March 2005

    • Kalbarczyk, Z.; Iyer, R. K.
    • IEEE Internet Computing, Vol. 9, Issue 2
    • DOI: 10.1109/mic.2005.31