An analysis of algorithm-based fault tolerance techniques
|
journal
|
April 1988 |
Toward resilient algorithms and applications
|
conference
|
January 2013 |
The Open Run-Time Environment (OpenRTE): A transparent multicluster environment for high-performance computing
|
journal
|
February 2008 |
Local recovery and failure masking for stencil-based applications at extreme scales
- Gamell, Marc; Teranishi, Keita; Heroux, Michael A.
-
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
https://doi.org/10.1145/2807591.2807672
|
conference
|
January 2015 |
A bridging model for parallel computation
|
journal
|
August 1990 |
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
|
conference
|
January 2014 |
Algorithm-based fault tolerance applied to high performance computing
|
journal
|
April 2009 |
An evaluation of User-Level Failure Mitigation support in MPI
|
journal
|
May 2013 |
Evaluating and extending user-level fault tolerance in MPI applications
|
journal
|
July 2016 |
Evaluating the viability of process replication reliability for exascale systems
- Ferreira, Kurt; Stearley, Jon; Laros, James H.
-
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
https://doi.org/10.1145/2063384.2063443
|
conference
|
January 2011 |
Berkeley lab checkpoint/restart (BLCR) for Linux clusters
|
journal
|
September 2006 |
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
|
journal
|
November 2005 |
SLURM: Simple Linux Utility for Resource Management
|
book
|
January 2003 |
Low-latency, concurrent checkpointing for parallel programs
|
journal
|
January 1994 |
Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation
|
book
|
January 2004 |
P N MPI tools : a whole lot greater than the sum of their parts
|
conference
|
January 2007 |
Recovery in distributed systems using asynchronous message logging and checkpointing
|
conference
|
January 1988 |
Interconnect agnostic checkpoint/restart in open MPI
- Hursey, Joshua; Mattox, Timothy I.; Lumsdaine, Andrew
-
Proceedings of the 18th ACM international symposium on High performance distributed computing - HPDC '09
https://doi.org/10.1145/1551609.1551619
|
conference
|
January 2009 |
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
|
journal
|
February 2013 |
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
- Li, Dong; Chen, Zizhong; Wu, Panruo
-
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
https://doi.org/10.1145/2503210.2503226
|
conference
|
January 2013 |
Distributed snapshots: determining global states of distributed systems
|
journal
|
February 1985 |
PMI Extensions for Scalable MPI Startup
|
conference
|
January 2014 |
A survey of rollback-recovery protocols in message-passing systems
|
journal
|
September 2002 |
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
|
book
|
January 2000 |
Algorithm-based fault tolerance on a hypercube multiprocessor
|
journal
|
January 1990 |
Toward Exascale Resilience
|
journal
|
September 2009 |
Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI
- Hassani, Amin; Skjellum, Anthony; Brightwell, Ron
-
2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
https://doi.org/10.1109/DSN.2014.78
|
conference
|
June 2014 |
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
- Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
-
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
https://doi.org/10.1109/SC.2014.78
|
conference
|
November 2014 |
FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery
- Sato, Kento; Moody, Adam; Mohror, Kathryn
-
2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium
https://doi.org/10.1109/IPDPS.2014.126
|
conference
|
May 2014 |
HARNESS and fault tolerant MPI
|
journal
|
October 2001 |
Simplifying the Recovery Model of User-Level Failure Mitigation
|
conference
|
November 2014 |
Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application
- Karlin, Ian; Bhatele, Abhinav; Keasler, Jeff
-
2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
https://doi.org/10.1109/IPDPS.2013.115
|
conference
|
May 2013 |
Failure Detection and Propagation in HPC systems
- Bosilca, George; Bouteiller, Aurelien; Guermouche, Amina
-
SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
https://doi.org/10.1109/SC.2016.26
|
conference
|
November 2016 |
Non-Blocking PMI Extensions for Fast MPI Startup
|
conference
|
May 2015 |
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
- Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
-
2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
https://doi.org/10.1109/SC.2010.18
|
conference
|
November 2010 |
Designing and Evaluating MPI-2 Dynamic Process Management Support for InfiniBand
|
conference
|
September 2009 |
The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI
|
conference
|
March 2007 |
Replication-Based Fault Tolerance for MPI Applications
|
journal
|
July 2009 |
Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters
- Chandrasekar, Raghunath Raja; Venkatesh, Akshay; Hamidouche, Khaled
-
2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
https://doi.org/10.1109/CCGrid.2015.169
|
conference
|
May 2015 |
Coordinated checkpoint versus message log for fault tolerant MPI
|
journal
|
January 2004 |
Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI
|
conference
|
November 2006 |
Algorithm-Based Fault Tolerance for Matrix Operations
|
journal
|
June 1984 |
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
|
conference
|
January 2002 |
An Analysis Of Algorithm-Based Fault Tolerance Techniques
|
conference
|
April 1986 |
Toward Resilient Algorithms and Applications
|
preprint
|
January 2014 |