skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

This content will become publicly available on July 25, 2020

Title: Application health monitoring for extreme-scale resiliency using cooperative fault management

Abstract

Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. In this work, we introduce a novel application-driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns, as indicators of an application's health and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics, and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. Lastly, the developed approach is general and can be easily applied to other applications.

Authors:
ORCiD logo [1]; ORCiD logo [2]; ORCiD logo [2]; ORCiD logo [2];  [3]; ORCiD logo [2]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Univ. of Tennessee, Knoxville, TN (United States)
  2. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  3. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); IBM Systems, IBM, Rochester, MN (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
1558573
Grant/Contract Number:  
[AC05-00OR22725]
Resource Type:
Accepted Manuscript
Journal Name:
Concurrency and Computation. Practice and Experience
Additional Journal Information:
[Journal Name: Concurrency and Computation. Practice and Experience]; Journal ID: ISSN 1532-0626
Publisher:
Wiley
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; exascale resiliency; fault tolerance; heterogeneous systems; molecular dynamics; quantum chemistry calculations; silent errors

Citation Formats

Agarwal, Pratul K., Naughton, III, Thomas, Park, Byung H., Bernholdt, David E., Hursey, Joshua J., and Geist, II, Al. Application health monitoring for extreme-scale resiliency using cooperative fault management. United States: N. p., 2019. Web. doi:10.1002/cpe.5449.
Agarwal, Pratul K., Naughton, III, Thomas, Park, Byung H., Bernholdt, David E., Hursey, Joshua J., & Geist, II, Al. Application health monitoring for extreme-scale resiliency using cooperative fault management. United States. doi:10.1002/cpe.5449.
Agarwal, Pratul K., Naughton, III, Thomas, Park, Byung H., Bernholdt, David E., Hursey, Joshua J., and Geist, II, Al. Thu . "Application health monitoring for extreme-scale resiliency using cooperative fault management". United States. doi:10.1002/cpe.5449.
@article{osti_1558573,
title = {Application health monitoring for extreme-scale resiliency using cooperative fault management},
author = {Agarwal, Pratul K. and Naughton, III, Thomas and Park, Byung H. and Bernholdt, David E. and Hursey, Joshua J. and Geist, II, Al},
abstractNote = {Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. In this work, we introduce a novel application-driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns, as indicators of an application's health and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics, and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. Lastly, the developed approach is general and can be easily applied to other applications.},
doi = {10.1002/cpe.5449},
journal = {Concurrency and Computation. Practice and Experience},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {7}
}

Journal Article:
Free Publicly Available Full Text
This content will become publicly available on July 25, 2020
Publisher's Version of Record

Save / Share:

Works referenced in this record:

A survey of high-performance computing scaling challenges
journal, July 2016

  • Geist, Al; Reed, Daniel A.
  • The International Journal of High Performance Computing Applications, Vol. 31, Issue 1
  • DOI: 10.1177/1094342015597083

Exascale fault tolerance challenge and approaches
conference, March 2018


Leveraging near data processing for high-performance checkpoint/restart
conference, January 2017

  • Agrawal, Abhinav; Loh, Gabriel H.; Tuck, James
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • DOI: 10.1145/3126908.3126918

CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
conference, September 2009

  • Gupta, Rinku; Beckman, Pete; Park, Byung-Hoon
  • 2009 International Conference on Parallel Processing (ICPP)
  • DOI: 10.1109/ICPP.2009.20

ABFR: convenient management of latent error resilience using application knowledge
conference, January 2018

  • Fang, Aiman; Chien, Andrew A.
  • Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18
  • DOI: 10.1145/3208040.3208046

OVIS: a tool for intelligent, real-time monitoring of computational clusters
conference, January 2006

  • Brandt, J. M.; Gentile, A. C.; Hale, D. J.
  • Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2006.1639698

Enhancing application robustness through adaptive fault tolerance
conference, April 2008

  • Lan, Zhiling; Li, Yawei; Zheng, Ziming
  • Distributed Processing Symposium (IPDPS), 2008 IEEE International Symposium on Parallel and Distributed Processing
  • DOI: 10.1109/IPDPS.2008.4536383

A survey of rollback-recovery protocols in message-passing systems
journal, September 2002

  • Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
  • ACM Computing Surveys, Vol. 34, Issue 3
  • DOI: 10.1145/568522.568525

Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984

  • Kuang-Hua Huang, ; Abraham, Jacob A.
  • IEEE Transactions on Computers, Vol. C-33, Issue 6
  • DOI: 10.1109/TC.1984.1676475

ER einit : Scalable and efficient fault-tolerance for bulk-synchronous MPI applications : ER
journal, August 2018

  • Chakraborty, Sourav; Laguna, Ignacio; Emani, Murali
  • Concurrency and Computation: Practice and Experience
  • DOI: 10.1002/cpe.4863

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques
conference, August 2016

  • Gamell, Marc; Katz, Daniel S.; Teranishi, Keita
  • 2016 45th International Conference on Parallel Processing Workshops (ICPPW)
  • DOI: 10.1109/ICPPW.2016.56

A survey of MPI usage in the US exascale computing project: A survey of MPI usage in the U. S. exascale computing project
journal, September 2018

  • Bernholdt, David E.; Boehm, Swen; Bosilca, George
  • Concurrency and Computation: Practice and Experience
  • DOI: 10.1002/cpe.4851

DRAM errors in the wild: a large-scale field study
journal, February 2011

  • Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich
  • Communications of the ACM, Vol. 54, Issue 2
  • DOI: 10.1145/1897816.1897844

An analysis of latent sector errors in disk drives
conference, January 2007

  • Bairavasundaram, Lakshmi N.; Goodson, Garth R.; Pasupathy, Shankar
  • Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '07
  • DOI: 10.1145/1254882.1254917

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
conference, February 2015

  • Tiwari, Devesh; Gupta, Saurabh; Rogers, James
  • 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA)
  • DOI: 10.1109/HPCA.2015.7056044

Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo
conference, November 2018

  • Levy, Scott; Ferreira, Kurt B.; DeBardeleben, Nathan
  • SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2018.00046

Application Resilience: Making Progress in Spite of Failure
conference, May 2008

  • Jones, William M.; Daly, John T.; DeBardeleben, Nathan A.
  • 2008 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
  • DOI: 10.1109/CCGRID.2008.99

Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures: PERFORMANCE MODELING OF MD ON HETEROGENEOUS ARCHITECTURES
journal, October 2012

  • Agarwal, Pratul K.; Hampton, Scott; Poznanovic, Jeffrey
  • Concurrency and Computation: Practice and Experience, Vol. 25, Issue 10
  • DOI: 10.1002/cpe.2943

Realization of User Level Fault Tolerant Policy Management through a Holistic Approach for Fault Correlation
conference, June 2011

  • Park, Byung H.; Naughton, Thomas J.; Agarwal, Pratul
  • 2011 IEEE International Symposium on Policies for Distributed Systems and Networks - POLICY
  • DOI: 10.1109/POLICY.2011.34

Pentium FDIV flaw-lessons learned
journal, April 1995


Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability
conference, January 2007

  • Glosli, J. N.; Richards, D. F.; Caspersen, K. J.
  • Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
  • DOI: 10.1145/1362622.1362700

Liquid water: obtaining the right answer for the right reasons
conference, January 2009

  • Aprà, Edoardo; Rendell, Alistair P.; Harrison, Robert J.
  • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
  • DOI: 10.1145/1654059.1654127

Analyzing the soft error resilience of linear solvers on multicore multiprocessors
conference, April 2010

  • Malkowski, Konrad; Raghavan, Padma; Kandemir, Mahmut
  • 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
  • DOI: 10.1109/IPDPS.2010.5470411

Fault tolerant preconditioned conjugate gradient for sparse linear system solution
conference, January 2012

  • Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
  • Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
  • DOI: 10.1145/2304576.2304588

Characterizing the impact of soft errors on iterative methods in scientific computing
conference, January 2011

  • Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
  • Proceedings of the international conference on Supercomputing - ICS '11
  • DOI: 10.1145/1995896.1995922

Evaluating the Impact of SDC on the GMRES Iterative Solver
conference, May 2014

  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.123

A Numerical Soft Fault Model for Iterative Linear Solvers
conference, January 2015

  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15
  • DOI: 10.1145/2749246.2749254

Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
journal, October 2016

  • Di, Sheng; Cappello, Franck
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 27, Issue 10
  • DOI: 10.1109/TPDS.2016.2517639

Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers
conference, September 2017

  • Guhur, Pierre-Louis; Constantinescu, Emil; Ghosh, Debojyoti
  • 2017 IEEE International Conference on Cluster Computing (CLUSTER)
  • DOI: 10.1109/CLUSTER.2017.13

Granularity and the Cost of Error Recovery in Resilient AMR Scientific Applications
conference, November 2016

  • Dubey, Anshu; Fujita, Hajime; Graves, Daniel T.
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2016.41

Resilience for Stencil Computations with Latent Errors
conference, August 2017

  • Fang, Aiman; Cavelan, Aurelien; Robert, Yves
  • 2017 46th International Conference on Parallel Processing (ICPP)
  • DOI: 10.1109/ICPP.2017.67

Fast Parallel Algorithms for Short-Range Molecular Dynamics
journal, March 1995


Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations
conference, November 2010

  • Hampton, Scott S.; Alam, Sadaf R.; Crozier, Paul S.
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2010.37

Energy efficient biomolecular simulations with FPGA-based reconfigurable computing
conference, January 2010

  • Nallamuthu, Ananth; Smith, Melissa C.; Hampton, Scott
  • Proceedings of the 7th ACM international conference on Computing frontiers - CF '10
  • DOI: 10.1145/1787275.1787294

General atomic and molecular electronic structure system
journal, November 1993

  • Schmidt, Michael W.; Baldridge, Kim K.; Boatz, Jerry A.
  • Journal of Computational Chemistry, Vol. 14, Issue 11, p. 1347-1363
  • DOI: 10.1002/jcc.540141112

Exploring versioned distributed arrays for resilience in scientific applications: global view resilience
journal, September 2016

  • Chien, A.; Balaji, P.; Dun, N.
  • The International Journal of High Performance Computing Applications, Vol. 31, Issue 6
  • DOI: 10.1177/1094342016664796

When is multi-version checkpointing needed?
conference, January 2013

  • Lu, Guoming; Zheng, Ziming; Chien, Andrew A.
  • Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13
  • DOI: 10.1145/2465813.2465821

The ganglia distributed monitoring system: design, implementation, and experience
journal, July 2004


CoMon: a mostly-scalable monitoring system for PlanetLab
journal, January 2006

  • Park, KyoungSoo; Pai, Vivek S.
  • ACM SIGOPS Operating Systems Review, Vol. 40, Issue 1
  • DOI: 10.1145/1113361.1113374

What Supercomputers Say: A Study of Five System Logs
conference, June 2007

  • Oliner, Adam; Stearley, Jon
  • 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
  • DOI: 10.1109/DSN.2007.103

Understanding failures in petascale computers
journal, July 2007


Failures in large scale systems: long-term measurement, analysis, and implications
conference, January 2017

  • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
  • DOI: 10.1145/3126908.3126937

Hierarchical error detection in a software implemented fault tolerance (SIFT) environment
journal, January 2000

  • Bagchi, S.; Srinivasan, B.; Whisnant, K.
  • IEEE Transactions on Knowledge and Data Engineering, Vol. 12, Issue 2
  • DOI: 10.1109/69.842263

Application Fault Tolerance with Armor Middleware
journal, March 2005

  • Kalbarczyk, Z.; Iyer, R. K.
  • IEEE Internet Computing, Vol. 9, Issue 2
  • DOI: 10.1109/MIC.2005.31

Modeling Input-Dependent Error Propagation in Programs
conference, June 2018

  • Li, Guanpeng; Pattabiraman, Karthik
  • 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • DOI: 10.1109/DSN.2018.00038

A concise introduction to autonomic computing
journal, July 2005

  • Sterritt, Roy; Parashar, Manish; Tianfield, Huaglory
  • Advanced Engineering Informatics, Vol. 19, Issue 3
  • DOI: 10.1016/j.aei.2005.05.012

A hybrid fault tolerance scheme for EasyGrid MPI applications
conference, January 2011

  • da Silva, Jacques A.; Rebello, Vinod E. F.
  • Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science - MGC '11
  • DOI: 10.1145/2089002.2089006

A large-scale study of soft-errors on GPUs in the field
conference, March 2016

  • Nie, Bin; Tiwari, Devesh; Gupta, Saurabh
  • 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)
  • DOI: 10.1109/HPCA.2016.7446091

Soft error vulnerability of iterative linear algebra methods
conference, January 2008

  • Bronevetsky, Greg; de Supinski, Bronis
  • Proceedings of the 22nd annual international conference on Supercomputing - ICS '08
  • DOI: 10.1145/1375527.1375552

Toward Exascale Resilience
journal, September 2009

  • Cappello, Franck; Geist, Al; Gropp, Bill
  • The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
  • DOI: 10.1177/1094342009347767

PLFS: a checkpoint filesystem for parallel applications
conference, January 2009


Provisioning a Multi-tiered Data Staging Area for Extreme-Scale Machines
conference, June 2011

  • Prabhakar, Ramya; Vazhkudai, Sudharshan S.; Kim, Youngjae
  • 2011 31st International Conference on Distributed Computing Systems (ICDCS)
  • DOI: 10.1109/ICDCS.2011.33

Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
conference, January 2009

  • Dong, Xiangyu; Muralimanohar, Naveen; Jouppi, Norm
  • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
  • DOI: 10.1145/1654059.1654117

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
journal, January 2008

  • Langou, J.; Chen, Z.; Bosilca, G.
  • SIAM Journal on Scientific Computing, Vol. 30, Issue 1
  • DOI: 10.1137/040620394

Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
conference, January 2013

  • Chen, Zizhong
  • Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13
  • DOI: 10.1145/2442516.2442533

    Works referencing / citing this record:

    Characterizing the impact of soft errors on iterative methods in scientific computing
    conference, January 2011

    • Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
    • Proceedings of the international conference on Supercomputing - ICS '11
    • DOI: 10.1145/1995896.1995922

    Toward Exascale Resilience
    journal, September 2009

    • Cappello, Franck; Geist, Al; Gropp, Bill
    • The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
    • DOI: 10.1177/1094342009347767

    ABFR: convenient management of latent error resilience using application knowledge
    conference, January 2018

    • Fang, Aiman; Chien, Andrew A.
    • Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '18
    • DOI: 10.1145/3208040.3208046

    A concise introduction to autonomic computing
    journal, July 2005

    • Sterritt, Roy; Parashar, Manish; Tianfield, Huaglory
    • Advanced Engineering Informatics, Vol. 19, Issue 3
    • DOI: 10.1016/j.aei.2005.05.012

    Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
    conference, January 2009

    • Dong, Xiangyu; Muralimanohar, Naveen; Jouppi, Norm
    • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
    • DOI: 10.1145/1654059.1654117

    A survey of MPI usage in the US exascale computing project: A survey of MPI usage in the U. S. exascale computing project
    journal, September 2018

    • Bernholdt, David E.; Boehm, Swen; Bosilca, George
    • Concurrency and Computation: Practice and Experience
    • DOI: 10.1002/cpe.4851

    A survey of rollback-recovery protocols in message-passing systems
    journal, September 2002

    • Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
    • ACM Computing Surveys, Vol. 34, Issue 3
    • DOI: 10.1145/568522.568525

    PLFS: a checkpoint filesystem for parallel applications
    conference, January 2009


    CoMon: a mostly-scalable monitoring system for PlanetLab
    journal, January 2006

    • Park, KyoungSoo; Pai, Vivek S.
    • ACM SIGOPS Operating Systems Review, Vol. 40, Issue 1
    • DOI: 10.1145/1113361.1113374

    An analysis of latent sector errors in disk drives
    conference, January 2007

    • Bairavasundaram, Lakshmi N.; Goodson, Garth R.; Pasupathy, Shankar
    • Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems - SIGMETRICS '07
    • DOI: 10.1145/1254882.1254917

    Hierarchical error detection in a software implemented fault tolerance (SIFT) environment
    journal, January 2000

    • Bagchi, S.; Srinivasan, B.; Whisnant, K.
    • IEEE Transactions on Knowledge and Data Engineering, Vol. 12, Issue 2
    • DOI: 10.1109/69.842263

    Leveraging near data processing for high-performance checkpoint/restart
    conference, January 2017

    • Agrawal, Abhinav; Loh, Gabriel H.; Tuck, James
    • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
    • DOI: 10.1145/3126908.3126918

    Liquid water: obtaining the right answer for the right reasons
    conference, January 2009

    • Aprà, Edoardo; Rendell, Alistair P.; Harrison, Robert J.
    • Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09
    • DOI: 10.1145/1654059.1654127

    Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures: PERFORMANCE MODELING OF MD ON HETEROGENEOUS ARCHITECTURES
    journal, October 2012

    • Agarwal, Pratul K.; Hampton, Scott; Poznanovic, Jeffrey
    • Concurrency and Computation: Practice and Experience, Vol. 25, Issue 10
    • DOI: 10.1002/cpe.2943

    Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
    conference, January 2013

    • Chen, Zizhong
    • Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '13
    • DOI: 10.1145/2442516.2442533

    A survey of high-performance computing scaling challenges
    journal, July 2016

    • Geist, Al; Reed, Daniel A.
    • The International Journal of High Performance Computing Applications, Vol. 31, Issue 1
    • DOI: 10.1177/1094342015597083

    Soft error vulnerability of iterative linear algebra methods
    conference, January 2008

    • Bronevetsky, Greg; de Supinski, Bronis
    • Proceedings of the 22nd annual international conference on Supercomputing - ICS '08
    • DOI: 10.1145/1375527.1375552

    Fast Parallel Algorithms for Short-Range Molecular Dynamics
    journal, March 1995


    DRAM errors in the wild: a large-scale field study
    journal, February 2011

    • Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich
    • Communications of the ACM, Vol. 54, Issue 2
    • DOI: 10.1145/1897816.1897844

    Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability
    conference, January 2007

    • Glosli, J. N.; Richards, D. F.; Caspersen, K. J.
    • Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
    • DOI: 10.1145/1362622.1362700

    Understanding failures in petascale computers
    journal, July 2007


    Exploring versioned distributed arrays for resilience in scientific applications: global view resilience
    journal, September 2016

    • Chien, A.; Balaji, P.; Dun, N.
    • The International Journal of High Performance Computing Applications, Vol. 31, Issue 6
    • DOI: 10.1177/1094342016664796

    Energy efficient biomolecular simulations with FPGA-based reconfigurable computing
    conference, January 2010

    • Nallamuthu, Ananth; Smith, Melissa C.; Hampton, Scott
    • Proceedings of the 7th ACM international conference on Computing frontiers - CF '10
    • DOI: 10.1145/1787275.1787294

    Fault tolerant preconditioned conjugate gradient for sparse linear system solution
    conference, January 2012

    • Shantharam, Manu; Srinivasmurthy, Sowmyalatha; Raghavan, Padma
    • Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
    • DOI: 10.1145/2304576.2304588

    Pentium FDIV flaw-lessons learned
    journal, April 1995


    A hybrid fault tolerance scheme for EasyGrid MPI applications
    conference, January 2011

    • da Silva, Jacques A.; Rebello, Vinod E. F.
    • Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science - MGC '11
    • DOI: 10.1145/2089002.2089006

    Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
    journal, January 2008

    • Langou, J.; Chen, Z.; Bosilca, G.
    • SIAM Journal on Scientific Computing, Vol. 30, Issue 1
    • DOI: 10.1137/040620394

    A Numerical Soft Fault Model for Iterative Linear Solvers
    conference, January 2015

    • Elliott, James; Hoemmen, Mark; Mueller, Frank
    • Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15
    • DOI: 10.1145/2749246.2749254

    ER einit : Scalable and efficient fault-tolerance for bulk-synchronous MPI applications : ER
    journal, August 2018

    • Chakraborty, Sourav; Laguna, Ignacio; Emani, Murali
    • Concurrency and Computation: Practice and Experience
    • DOI: 10.1002/cpe.4863

    Failures in large scale systems: long-term measurement, analysis, and implications
    conference, January 2017

    • Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian
    • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17
    • DOI: 10.1145/3126908.3126937

    When is multi-version checkpointing needed?
    conference, January 2013

    • Lu, Guoming; Zheng, Ziming; Chien, Andrew A.
    • Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13
    • DOI: 10.1145/2465813.2465821

    General atomic and molecular electronic structure system
    journal, November 1993

    • Schmidt, Michael W.; Baldridge, Kim K.; Boatz, Jerry A.
    • Journal of Computational Chemistry, Vol. 14, Issue 11, p. 1347-1363
    • DOI: 10.1002/jcc.540141112

    The ganglia distributed monitoring system: design, implementation, and experience
    journal, July 2004