skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: ER einit : Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications

Authors:
 [1]; ORCiD logo [2];  [2];  [2];  [1];  [3];  [1]
  1. The Ohio State University Columbus Ohio
  2. Lawrence Livermore National Laboratory Livermore California
  3. Technische Universität München Munich Germany
Publication Date:
Sponsoring Org.:
USDOE
OSTI Identifier:
1464593
Grant/Contract Number:  
[DEAC52-07NA27344; (LLNL-CONF-706037)]
Resource Type:
Publisher's Accepted Manuscript
Journal Name:
Concurrency and Computation. Practice and Experience
Additional Journal Information:
[Journal Name: Concurrency and Computation. Practice and Experience Journal Volume: 32 Journal Issue: 3]; Journal ID: ISSN 1532-0626
Publisher:
Wiley Blackwell (John Wiley & Sons)
Country of Publication:
United Kingdom
Language:
English

Citation Formats

Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, and Subramoni, Hari. ER einit : Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications. United Kingdom: N. p., 2018. Web. doi:10.1002/cpe.4863.
Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, & Subramoni, Hari. ER einit : Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications. United Kingdom. doi:10.1002/cpe.4863.
Chakraborty, Sourav, Laguna, Ignacio, Emani, Murali, Mohror, Kathryn, Panda, Dhabaleswar K., Schulz, Martin, and Subramoni, Hari. Tue . "ER einit : Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications". United Kingdom. doi:10.1002/cpe.4863.
@article{osti_1464593,
title = {ER einit : Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications},
author = {Chakraborty, Sourav and Laguna, Ignacio and Emani, Murali and Mohror, Kathryn and Panda, Dhabaleswar K. and Schulz, Martin and Subramoni, Hari},
abstractNote = {},
doi = {10.1002/cpe.4863},
journal = {Concurrency and Computation. Practice and Experience},
number = [3],
volume = [32],
place = {United Kingdom},
year = {2018},
month = {8}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record
DOI: 10.1002/cpe.4863

Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand
conference, September 2009

  • Gangadharappa, Tejus; Koop, Matthew; Panda, Dhabaleswar K.
  • 2009 International Conference on Parallel Processing Workshops (ICPPW)
  • DOI: 10.1109/ICPPW.2009.77

Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters
conference, May 2015

  • Chandrasekar, Raghunath Raja; Venkatesh, Akshay; Hamidouche, Khaled
  • 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
  • DOI: 10.1109/CCGrid.2015.169

Algorithm-Based Fault Tolerance for Matrix Operations
journal, June 1984

  • Kuang-Hua Huang, ; Abraham, Jacob A.
  • IEEE Transactions on Computers, Vol. C-33, Issue 6
  • DOI: 10.1109/TC.1984.1676475

An analysis of algorithm-based fault tolerance techniques
journal, April 1988


Toward resilient algorithms and applications
conference, January 2013

  • Heroux, Michael A.
  • Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13
  • DOI: 10.1145/2465813.2465814

FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery
conference, May 2014

  • Sato, Kento; Moody, Adam; Mohror, Kathryn
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2014.126

Simplifying the Recovery Model of User-Level Failure Mitigation
conference, November 2014

  • Bland, Wesley; Raffenetti, Kenneth; Balaji, Pavan
  • 2014 Workshop on Exascale MPI at Supercomputing Conference (ExaMPI)
  • DOI: 10.1109/ExaMPI.2014.4

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI
conference, March 2007

  • Hursey, Joshua; Squyres, Jeffrey M.; Mattox, Timothy I.
  • 2007 IEEE International Parallel and Distributed Processing Symposium
  • DOI: 10.1109/IPDPS.2007.370605

Replication-Based Fault Tolerance for MPI Applications
journal, July 2009

  • Walters, J. P.; Chaudhary, V.
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 20, Issue 7
  • DOI: 10.1109/TPDS.2008.172

Local recovery and failure masking for stencil-based applications at extreme scales
conference, January 2015

  • Gamell, Marc; Teranishi, Keita; Heroux, Michael A.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
  • DOI: 10.1145/2807591.2807672

Recovery in distributed systems using asynchronous message logging and checkpointing
conference, January 1988

  • Johnson, David B.; Zwaenepoel, Willy
  • Proceedings of the seventh annual ACM Symposium on Principles of distributed computing - PODC '88
  • DOI: 10.1145/62546.62575

Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
conference, November 2014

  • Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2014.78

Interconnect agnostic checkpoint/restart in open MPI
conference, January 2009

  • Hursey, Joshua; Mattox, Timothy I.; Lumsdaine, Andrew
  • Proceedings of the 18th ACM international symposium on High performance distributed computing - HPDC '09
  • DOI: 10.1145/1551609.1551619

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
journal, February 2013

  • Egwutuoha, Ifeanyi P.; Levy, David; Selic, Bran
  • The Journal of Supercomputing, Vol. 65, Issue 3
  • DOI: 10.1007/s11227-013-0884-0

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
conference, January 2013

  • Li, Dong; Chen, Zizhong; Wu, Panruo
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
  • DOI: 10.1145/2503210.2503226

A bridging model for parallel computation
journal, August 1990


Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
conference, January 2014

  • Teranishi, Keita; Heroux, Michael A.
  • Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14
  • DOI: 10.1145/2642769.2642774

Distributed snapshots: determining global states of distributed systems
journal, February 1985

  • Chandy, K. Mani; Lamport, Leslie
  • ACM Transactions on Computer Systems, Vol. 3, Issue 1
  • DOI: 10.1145/214451.214456

Algorithm-based fault tolerance applied to high performance computing
journal, April 2009

  • Bosilca, George; Delmas, Rémi; Dongarra, Jack
  • Journal of Parallel and Distributed Computing, Vol. 69, Issue 4
  • DOI: 10.1016/j.jpdc.2008.12.002

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
conference, November 2010

  • Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2010.18

Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application
conference, May 2013

  • Karlin, Ian; Bhatele, Abhinav; Keasler, Jeff
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
  • DOI: 10.1109/IPDPS.2013.115

PMI Extensions for Scalable MPI Startup
conference, January 2014

  • Chakraborty, S.; Subramoni, H.; Perkins, J.
  • Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14
  • DOI: 10.1145/2642769.2642780

A survey of rollback-recovery protocols in message-passing systems
journal, September 2002

  • Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min
  • ACM Computing Surveys, Vol. 34, Issue 3
  • DOI: 10.1145/568522.568525

Evaluating and extending user-level fault tolerance in MPI applications
journal, July 2016

  • Laguna, Ignacio; Richards, David F.; Gamblin, Todd
  • The International Journal of High Performance Computing Applications, Vol. 30, Issue 3
  • DOI: 10.1177/1094342015623623

An evaluation of User-Level Failure Mitigation support in MPI
journal, May 2013


Non-Blocking PMI Extensions for Fast MPI Startup
conference, May 2015

  • Chakraborty, Sourav; Subramoni, Hari; Moody, Adam
  • 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
  • DOI: 10.1109/CCGrid.2015.151

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
conference, January 2002

  • Bosilca, G.; Bouteiller, A.; Cappello, F.
  • ACM/IEEE SC 2002 Conference (SC'02)
  • DOI: 10.1109/SC.2002.10048

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI
conference, November 2006

  • Coti, Camille; Herault, Thomas; Lemarinier, Pierre
  • SC 2006 Proceedings Supercomputing 2006, ACM/IEEE SC 2006 Conference (SC'06)
  • DOI: 10.1109/SC.2006.15

Evaluating the viability of process replication reliability for exascale systems
conference, January 2011

  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • DOI: 10.1145/2063384.2063443

Berkeley lab checkpoint/restart (BLCR) for Linux clusters
journal, September 2006


The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
journal, November 2005

  • Sankaran, Sriram; Squyres, Jeffrey M.; Barrett, Brian
  • The International Journal of High Performance Computing Applications, Vol. 19, Issue 4
  • DOI: 10.1177/1094342005056139

Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
conference, June 2014

  • Hassani, Amin; Skjellum, Anthony; Brightwell, Ron
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • DOI: 10.1109/DSN.2014.78

Low-latency, concurrent checkpointing for parallel programs
journal, January 1994

  • Kai Li, ; Naughton, J. F.; Plank, J. S.
  • IEEE Transactions on Parallel and Distributed Systems, Vol. 5, Issue 8
  • DOI: 10.1109/71.298215

Algorithm-based fault tolerance on a hypercube multiprocessor
journal, January 1990

  • Banerjee, P.; Rahmeh, J. T.; Stunkel, C.
  • IEEE Transactions on Computers, Vol. 39, Issue 9
  • DOI: 10.1109/12.57055

HARNESS and fault tolerant MPI
journal, October 2001


Toward Exascale Resilience
journal, September 2009

  • Cappello, Franck; Geist, Al; Gropp, Bill
  • The International Journal of High Performance Computing Applications, Vol. 23, Issue 4
  • DOI: 10.1177/1094342009347767

Failure Detection and Propagation in HPC systems
conference, November 2016

  • Bosilca, George; Bouteiller, Aurelien; Guermouche, Amina
  • SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
  • DOI: 10.1109/SC.2016.26

Coordinated checkpoint versus message log for fault tolerant MPI
journal, January 2004

  • Lemarinier, Pierre; Bouteiller, Aurelien; Krawezik, Geraud
  • International Journal of High Performance Computing and Networking, Vol. 2, Issue 2/3/4
  • DOI: 10.1504/IJHPCN.2004.008899

P N MPI tools : a whole lot greater than the sum of their parts
conference, January 2007

  • Schulz, Martin; de Supinski, Bronis R.
  • Proceedings of the 2007 ACM/IEEE conference on Supercomputing - SC '07
  • DOI: 10.1145/1362622.1362663