EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

Chakraborty, Sourav; Laguna, Ignacio; Emani, Murali; Mohror, Kathryn; Panda, Dhabaleswar K.; Schulz, Martin; Subramoni, Hari

doi:10.1002/cpe.4863

Title: EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

Abstract

Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. Here, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.

Authors:

Chakraborty, Sourav ^[1];

^[2]; Emani, Murali ^[2]; Mohror, Kathryn ^[2]; Panda, Dhabaleswar K. ^[1]; Schulz, Martin ^[3]; Subramoni, Hari ^[1]

The Ohio State Univ., Columbus, OH (United States)
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Technical Univ. of Munich (Germany)

Publication Date:: Tue Aug 14 00:00:00 EDT 2018

Research Org.:: Lawrence Livermore National Laboratory (LLNL), Livermore, CA (United States)

Sponsoring Org.:: USDOE National Nuclear Security Administration (NNSA); National Science Foundation (NSF)

OSTI Identifier:: 1708993

Alternate Identifier(s):: OSTI ID: 1464593

Report Number(s):: LLNL-JRNL-706037
Journal ID: ISSN 1532-0626; 841204

Grant/Contract Number:: AC52-07NA27344; CCF-1565414; CNS-1419123; CNS-1513120; ACI-1450440; IIS-1447804

Resource Type:: Accepted Manuscript

Journal Name:: Concurrency and Computation. Practice and Experience

Additional Journal Information:: Journal Volume: 32; Journal Issue: 3; Journal ID: ISSN 1532-0626

Publisher:: Wiley

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING; fault tolerance; high-performance computing; MPI; resilience

Citation Formats


                    Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, and Subramoni, Hari. EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications.  United States: N. p., 2018. 
Web.  doi:10.1002/cpe.4863.

Copy to clipboard


                    Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, & Subramoni, Hari. EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications.  United States.  https://doi.org/10.1002/cpe.4863

Copy to clipboard


                    Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, and Subramoni, Hari. Tue .  
"EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications".  United States.  https://doi.org/10.1002/cpe.4863.  https://www.osti.gov/servlets/purl/1708993.

Copy to clipboard


                    
@article{osti_1708993,

  title        = {EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications},

  author       = {Chakraborty, Sourav and Laguna, Ignacio and Emani, Murali and Mohror, Kathryn and Panda, Dhabaleswar K. and Schulz, Martin and Subramoni, Hari},

  abstractNote = {Scientists from many different fields have been developing Bulk-Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger-scale future HPC systems, providing efficient fault-tolerance mechanisms for this class of applications is paramount. The global-restart model has been proposed to decrease the time of failure recovery in Bulk-Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. Here, we present EReinit, an implementation of the global-restart model that addresses these problems. Our key idea and optimization is the co-design of basic fault-tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.},

  doi          = {10.1002/cpe.4863},

  journal      = {Concurrency and Computation. Practice and Experience},

  number       = 3,

  volume       = 32,

  place        = {United States},

  year         = {Tue Aug 14 00:00:00 EDT 2018},

  month        = {Tue Aug 14 00:00:00 EDT 2018}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1002/cpe.4863

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 13 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

An analysis of algorithm-based fault tolerance techniques
journal, April 1988

Luk, Franklin T.; Park, Haesun
Journal of Parallel and Distributed Computing, Vol. 5, Issue 2
DOI: 10.1016/0743-7315(88)90027-5

Toward resilient algorithms and applications
conference, January 2013

Heroux, Michael A.
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13
DOI: 10.1145/2465813.2465814

The Open Run-Time Environment (OpenRTE): A transparent multicluster environment for high-performance computing
journal, February 2008

Castain, R. H.; Woodall, T. S.; Daniel, D. J.
Future Generation Computer Systems, Vol. 24, Issue 2
DOI: 10.1016/j.future.2007.03.010

Local recovery and failure masking for stencil-based applications at extreme scales
conference, January 2015

Gamell, Marc; Teranishi, Keita; Heroux, Michael A.
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
DOI: 10.1145/2807591.2807672

A bridging model for parallel computation
journal, August 1990

Valiant, Leslie G.
Communications of the ACM, Vol. 33, Issue 8
DOI: 10.1145/79173.79181

Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
conference, January 2014

Teranishi, Keita; Heroux, Michael A.
Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14
DOI: 10.1145/2642769.2642774

Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

Bosilca, George; Delmas, Rémi; Dongarra, Jack
Journal of Parallel and Distributed Computing, Vol. 69, Issue 4
DOI: 10.1016/j.jpdc.2008.12.002

An evaluation of User-Level Failure Mitigation support in MPI
journal, May 2013

Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas
Computing, Vol. 95, Issue 12
DOI: 10.1007/s00607-013-0331-3

Evaluating and extending user-level fault tolerance in MPI applications
journal, July 2016

Laguna, Ignacio; Richards, David F.; Gamblin, Todd
The International Journal of High Performance Computing Applications, Vol. 30, Issue 3
DOI: 10.1177/1094342015623623

Evaluating the viability of process replication reliability for exascale systems
conference, January 2011

Ferreira, Kurt; Stearley, Jon; Laros, James H.
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
DOI: 10.1145/2063384.2063443

Berkeley lab checkpoint/restart (BLCR) for Linux clusters
journal, September 2006

Hargrove, Paul H.; Duell, Jason C.
Journal of Physics: Conference Series, Vol. 46
DOI: 10.1088/1742-6596/46/1/067

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
journal, November 2005

Sankaran, Sriram; Squyres, Jeffrey M.; Barrett, Brian
The International Journal of High Performance Computing Applications, Vol. 19, Issue 4
DOI: 10.1177/1094342005056139

SLURM: Simple Linux Utility for Resource Management
book, January 2003

Yoo, Andy B.; Jette, Morris A.; Grondona, Mark
Job Scheduling Strategies for Parallel Processing
DOI: 10.1007/10968987_3

Low-latency, concurrent checkpointing for parallel programs
journal, January 1994

Kai Li, ; Naughton, J. F.; Plank, J. S.
IEEE Transactions on Parallel and Distributed Systems, Vol. 5, Issue 8
DOI: 10.1109/71.298215

Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
book, January 2004

Gabriel, Edgar; Fagg, Graham E.; Bosilca, George
Recent Advances in Parallel Virtual Machine and Message Passing Interface
DOI: 10.1007/978-3-540-30218-6_19

P N MPI tools : a whole lot greater than the sum of their parts
conference, January 2007

Schulz, Martin; de Supinski, Bronis R.
Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
DOI: 10.1145/1362622.1362663

Recovery in distributed systems using asynchronous message logging and checkpointing
conference, January 1988

Johnson, David B.; Zwaenepoel, Willy
Proceedings of the seventh annual ACM Symposium on Principles of distributed computing - PODC '88
DOI: 10.1145/62546.62575

Interconnect agnostic checkpoint/restart in open MPI
conference, January 2009

Hursey, Joshua; Mattox, Timothy I.; Lumsdaine, Andrew
Proceedings of the 18th ACM international symposium on High performance distributed computing - HPDC '09
DOI: 10.1145/1551609.1551619

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
journal, February 2013

Egwutuoha, Ifeanyi P.; Levy, David; Selic, Bran
The Journal of Supercomputing, Vol. 65, Issue 3
DOI: 10.1007/s11227-013-0884-0

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
conference, January 2013

Li, Dong; Chen, Zizhong; Wu, Panruo
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
DOI: 10.1145/2503210.2503226

Distributed snapshots: determining global states of distributed systems
journal, February 1985

Chandy, K. Mani; Lamport, Leslie
ACM Transactions on Computer Systems, Vol. 3, Issue 1
DOI: 10.1145/214451.214456

PMI Extensions for Scalable MPI Startup
conference, January 2014

Chakraborty, S.; Subramoni, H.; Perkins, J.
Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14
DOI: 10.1145/2642769.2642780

A survey of rollback-recovery protocols in message-passing systems
journal, September 2002

Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
ACM Computing Surveys, Vol. 34, Issue 3
DOI: 10.1145/568522.568525

FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
book, January 2000

Fagg, Graham E.; Dongarra, Jack J.
Recent Advances in Parallel Virtual Machine and Message Passing Interface
DOI: 10.1007/3-540-45255-9_47

Algorithm-based fault tolerance on a hypercube multiprocessor
journal, January 1990

Banerjee, P.; Rahmeh, J. T.; Stunkel, C.
IEEE Transactions on Computers, Vol. 39, Issue 9
DOI: 10.1109/12.57055

Toward Exascale Resilience
journal, September 2009

Cappello, Franck; Geist, Al; Gropp, Bill
The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
DOI: 10.1177/1094342009347767

Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
conference, June 2014

Hassani, Amin; Skjellum, Anthony; Brightwell, Ron
2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
DOI: 10.1109/DSN.2014.78

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
conference, November 2014

Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2014.78

FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery
conference, May 2014

Sato, Kento; Moody, Adam; Mohror, Kathryn
2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
DOI: 10.1109/IPDPS.2014.126

HARNESS and fault tolerant MPI
journal, October 2001

Fagg, Graham E.; Bukovsky, Antonin; Dongarra, Jack J.
Parallel Computing, Vol. 27, Issue 11
DOI: 10.1016/S0167-8191(01)00100-4

Simplifying the Recovery Model of User-Level Failure Mitigation
conference, November 2014

Bland, Wesley; Raffenetti, Kenneth; Balaji, Pavan
2014 Workshop on Exascale MPI at Supercomputing Conference (ExaMPI)
DOI: 10.1109/ExaMPI.2014.4

Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application
conference, May 2013

Karlin, Ian; Bhatele, Abhinav; Keasler, Jeff
2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
DOI: 10.1109/IPDPS.2013.115

Failure Detection and Propagation in HPC systems
conference, November 2016

Bosilca, George; Bouteiller, Aurelien; Guermouche, Amina
SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2016.26

Non-Blocking PMI Extensions for Fast MPI Startup
conference, May 2015

Chakraborty, Sourav; Subramoni, Hari; Moody, Adam
2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
DOI: 10.1109/CCGrid.2015.151

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010

Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
DOI: 10.1109/SC.2010.18

Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand
conference, September 2009

Gangadharappa, Tejus; Koop, Matthew; Panda, Dhabaleswar K.
2009 International Conference on Parallel Processing Workshops (ICPPW)
DOI: 10.1109/ICPPW.2009.77

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI
conference, March 2007

Hursey, Joshua; Squyres, Jeffrey M.; Mattox, Timothy I.
2007 IEEE International Parallel and Distributed Processing Symposium
DOI: 10.1109/IPDPS.2007.370605

Replication-Based Fault Tolerance for MPI Applications
journal, July 2009

Walters, J. P.; Chaudhary, V.
IEEE Transactions on Parallel and Distributed Systems, Vol. 20, Issue 7
DOI: 10.1109/TPDS.2008.172

Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters
conference, May 2015

Chandrasekar, Raghunath Raja; Venkatesh, Akshay; Hamidouche, Khaled
2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
DOI: 10.1109/CCGrid.2015.169

Coordinated checkpoint versus message log for fault tolerant MPI
journal, January 2004

Lemarinier, Pierre; Bouteiller, Aurelien; Krawezik, Geraud
International Journal of High Performance Computing and Networking, Vol. 2, Issue 2/3/4
DOI: 10.1504/IJHPCN.2004.008899

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI
conference, November 2006

Coti, Camille; Herault, Thomas; Lemarinier, Pierre
SC 2006 Proceedings Supercomputing 2006, ACM/IEEE SC 2006 Conference (SC'06)
DOI: 10.1109/SC.2006.15

Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984

Kuang-Hua Huang, ; Abraham, Jacob A.
IEEE Transactions on Computers, Vol. C-33, Issue 6
DOI: 10.1109/TC.1984.1676475

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
conference, January 2002

Bosilca, G.; Bouteiller, A.; Cappello, F.
ACM/IEEE SC 2002 Conference (SC'02)
DOI: 10.1109/SC.2002.10048

An Analysis Of Algorithm-Based Fault Tolerance Techniques
conference, April 1986

Luk, Franklin T.; Park, Haesun
30th Annual Technical Symposium, SPIE Proceedings
DOI: 10.1117/12.936896

Toward Resilient Algorithms and Applications
preprint, January 2014

Heroux, Michael A.
arXiv
DOI: 10.48550/arxiv.1402.3809

Works referencing / citing this record:

Foreword to the Special Issue of the Workshop on Exascale MPI (ExaMPI 2017)
journal, July 2019

Skjellum, Anthony; Bangalore, Purushotham V.; Grant, Ryan E.
Concurrency and Computation: Practice and Experience, Vol. 32, Issue 3
DOI: 10.1002/cpe.5459

Application health monitoring for extreme‐scale resiliency using cooperative fault management
journal, July 2019

Agarwal, Pratul K.; Naughton, Thomas; Park, Byung H.
Concurrency and Computation: Practice and Experience, Vol. 32, Issue 2
DOI: 10.1002/cpe.5449

Similar Records in DOE PAGES and OSTI.GOV collections:

MPI Stages: Checkpointing MPI State for Bulk Synchronous Applications

Journal Article Sultana, Nawrin ; Skjellum, Anthony ; Laguna, Ignacio ; ... - EuroMPI'18 Proceedings of the 25th European MPI Users' Group Meeting, Barcelona, Spain, September 23 - 26, 2018

When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job. A disadvantage of this method is that, although the failure occurred within the main application loop, live processes must start again from the beginning of the program, along with new replacement processes---this incurs unnecessary overhead for live processes. To avoid such overheads and concomitant delays, we introduce the concept of "MPI Stages." MPI Stages saves internal MPI state in a separate checkpoint in conjunction with application state. Upon failure, both MPI and applicationmore »« less
https://doi.org/10.1145/3236367.3236385
Building a Fault Tolerant MPI Application: A Ring Communication Example

Conference Hursey, Joshua J ; Graham, Richard L

Process failure is projected to become a normal event for many long running and scalable High Performance Computing (HPC) applications. As such many application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improve the efficiency of application recovery beyond what existing checkpoint/restart techniques alone can provide. Unfortunately for these application developers the libraries that their applications depend upon, like Message Passing Interface (MPI), do not have standardized fault tolerance semantics. This paper introduces the reader to a set of run-through stabilization semantics being developed by the MPI Forum's Fault Tolerance Working Group to support ABFT. Using a well-knownmore »« less
Combining Partial Redundancy and Checkpointing for HPC

Conference Elliott, James ; Kharbas, Kishor H ; Fiala, David J ; ...

Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^15 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are becoming the norm rather than the exception during execution of HPC applications. Current fault tolerance techniques in HPC focus on reactive ways to mitigate faults, namely via checkpoint and restart (C/R). Apart from storage overheads, C/R-based fault recovery comesmore »« less
Preserving Collective Performance Across Process Failure for a Fault Tolerant MPI

Conference Hursey, Joshua J ; Graham, Richard L

Application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improve the efficiency of application recovery beyond what traditional techniques alone can provide. Applications will depend on libraries to sustain failure-free performance across process failure to continue to efficiently use High Performance Computing (HPC) systems even in the presence of process failure. Optimized Message Passing Interface (MPI) collective operations are a critical component of many scalable HPC applications. However, most of the collective algorithms are not able to handle process failure. Next generation MPI implementations must provide fault aware versions of such algorithms that can sustain performance across processmore »« less
Evaluating and extending user-level fault tolerance in MPI applications

Journal Article Laguna, Ignacio ; Richards, David F. ; Gamblin, Todd ; ... - International Journal of High Performance Computing Applications

The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master–worker applications, it provides few benefits for more common bulkmore »« less
Cited by 22
https://doi.org/10.1177/1094342015623623

Full Text Available

Similar Records

Title: EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications

Abstract

Citation Formats

An analysis of algorithm-based fault tolerance techniques journal, April 1988

Toward resilient algorithms and applications conference, January 2013

The Open Run-Time Environment (OpenRTE): A transparent multicluster environment for high-performance computing journal, February 2008

Local recovery and failure masking for stencil-based applications at extreme scales conference, January 2015

A bridging model for parallel computation journal, August 1990

Toward Local Failure Local Recovery Resilience Model using MPI-ULFM conference, January 2014

Algorithm-based fault tolerance applied to high performance computing journal, April 2009

An evaluation of User-Level Failure Mitigation support in MPI journal, May 2013

Evaluating and extending user-level fault tolerance in MPI applications journal, July 2016

Evaluating the viability of process replication reliability for exascale systems conference, January 2011

Berkeley lab checkpoint/restart (BLCR) for Linux clusters journal, September 2006

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing journal, November 2005

SLURM: Simple Linux Utility for Resource Management book, January 2003

Low-latency, concurrent checkpointing for parallel programs journal, January 1994

Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation book, January 2004

P N MPI tools : a whole lot greater than the sum of their parts conference, January 2007

Recovery in distributed systems using asynchronous message logging and checkpointing conference, January 1988

Interconnect agnostic checkpoint/restart in open MPI conference, January 2009

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems journal, February 2013

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach conference, January 2013

Distributed snapshots: determining global states of distributed systems journal, February 1985

PMI Extensions for Scalable MPI Startup conference, January 2014

A survey of rollback-recovery protocols in message-passing systems journal, September 2002

FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World book, January 2000

Algorithm-based fault tolerance on a hypercube multiprocessor journal, January 1990

Toward Exascale Resilience journal, September 2009

Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI conference, June 2014

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales conference, November 2014

FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery conference, May 2014

HARNESS and fault tolerant MPI journal, October 2001

Simplifying the Recovery Model of User-Level Failure Mitigation conference, November 2014

Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application conference, May 2013

Failure Detection and Propagation in HPC systems conference, November 2016

Non-Blocking PMI Extensions for Fast MPI Startup conference, May 2015

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System conference, November 2010

Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand conference, September 2009

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI conference, March 2007

Replication-Based Fault Tolerance for MPI Applications journal, July 2009

Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters conference, May 2015

Coordinated checkpoint versus message log for fault tolerant MPI journal, January 2004

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI conference, November 2006

Algorithm-Based Fault Tolerance for Matrix Operations journal, June 1984

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes conference, January 2002

An Analysis Of Algorithm-Based Fault Tolerance Techniques conference, April 1986

Toward Resilient Algorithms and Applications preprint, January 2014

Foreword to the Special Issue of the Workshop on Exascale MPI (ExaMPI 2017) journal, July 2019

Application health monitoring for extreme‐scale resiliency using cooperative fault management journal, July 2019

An analysis of algorithm-based fault tolerance techniques
journal, April 1988

Toward resilient algorithms and applications
conference, January 2013

The Open Run-Time Environment (OpenRTE): A transparent multicluster environment for high-performance computing
journal, February 2008

Local recovery and failure masking for stencil-based applications at extreme scales
conference, January 2015

A bridging model for parallel computation
journal, August 1990

Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
conference, January 2014

Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

An evaluation of User-Level Failure Mitigation support in MPI
journal, May 2013

Evaluating and extending user-level fault tolerance in MPI applications
journal, July 2016

Evaluating the viability of process replication reliability for exascale systems
conference, January 2011

Berkeley lab checkpoint/restart (BLCR) for Linux clusters
journal, September 2006

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
journal, November 2005

SLURM: Simple Linux Utility for Resource Management
book, January 2003

Low-latency, concurrent checkpointing for parallel programs
journal, January 1994

Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
book, January 2004

P N MPI tools : a whole lot greater than the sum of their parts
conference, January 2007

Recovery in distributed systems using asynchronous message logging and checkpointing
conference, January 1988

Interconnect agnostic checkpoint/restart in open MPI
conference, January 2009

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
journal, February 2013

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
conference, January 2013

Distributed snapshots: determining global states of distributed systems
journal, February 1985

PMI Extensions for Scalable MPI Startup
conference, January 2014

A survey of rollback-recovery protocols in message-passing systems
journal, September 2002

FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
book, January 2000

Algorithm-based fault tolerance on a hypercube multiprocessor
journal, January 1990

Toward Exascale Resilience
journal, September 2009

Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
conference, June 2014

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
conference, November 2014

FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery
conference, May 2014

HARNESS and fault tolerant MPI
journal, October 2001

Simplifying the Recovery Model of User-Level Failure Mitigation
conference, November 2014

Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application
conference, May 2013

Failure Detection and Propagation in HPC systems
conference, November 2016

Non-Blocking PMI Extensions for Fast MPI Startup
conference, May 2015

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010

Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand
conference, September 2009

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI
conference, March 2007

Replication-Based Fault Tolerance for MPI Applications
journal, July 2009

Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters
conference, May 2015

Coordinated checkpoint versus message log for fault tolerant MPI
journal, January 2004

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI
conference, November 2006

Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
conference, January 2002

An Analysis Of Algorithm-Based Fault Tolerance Techniques
conference, April 1986

Toward Resilient Algorithms and Applications
preprint, January 2014

Foreword to the Special Issue of the Workshop on Exascale MPI (ExaMPI 2017)
journal, July 2019

Application health monitoring for extreme‐scale resiliency using cooperative fault management
journal, July 2019