Local Recovery and Failure Masking for Stencil-based Applications at Extreme Scales.
- Rutgers U
Abstract not provided.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE Office of Science (SC); USDOE National Nuclear Security Administration (NNSA)
- DOE Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1332855
- Report Number(s):
- SAND2015-10051C; 608212
- Resource Relation:
- Conference: Proposed for presentation at the The international conference for high performance computing, networking, storage and analysis. held November 15-20, 2015 in Austin, Texas.
- Country of Publication:
- United States
- Language:
- English
FTI: high performance fault tolerance interface for hybrid systems
|
conference | January 2011 |
Post-failure recovery of MPI communication capability: Design and rationale
|
journal | June 2013 |
Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery
|
conference | August 2009 |
Coordinated checkpoint versus message log for fault tolerant MPI
|
conference | January 2003 |
Distributed snapshots: determining global states of distributed systems
|
journal | February 1985 |
Terascale direct numerical simulations of turbulent combustion using S3D
|
journal | January 2009 |
Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI
|
conference | November 2006 |
A survey of rollback-recovery protocols in message-passing systems
|
journal | September 2002 |
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
|
conference | November 2014 |
Exploring Failure Recovery for Stencil-based Applications at Extreme Scales
|
conference | January 2015 |
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
|
conference | May 2011 |
Berkeley lab checkpoint/restart (BLCR) for Linux clusters
|
journal | September 2006 |
Toward resilient algorithms and applications
|
conference | January 2013 |
Interconnect agnostic checkpoint/restart in open MPI
|
conference | January 2009 |
The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI
|
conference | March 2007 |
On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance
|
conference | September 2012 |
MCREngine: A scalable checkpointing system using data-aware aggregation and compression
|
conference | November 2012 |
Optimizing Checkpoints Using NVM as Virtual Memory
|
conference | May 2013 |
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
|
conference | November 2010 |
Enhancing Checkpoint Performance with Staging IO and SSD
|
conference | May 2010 |
CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart
|
conference | September 2011 |
A 1 PB/s file system to checkpoint three million MPI tasks
|
conference | January 2013 |
FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery
|
conference | May 2014 |
DRAM errors in the wild: a large-scale field study
|
conference | January 2009 |
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
|
conference | January 2014 |
Tests and tolerances for high-performance software-implemented fault detection
|
journal | May 2003 |
A scalable double in-memory checkpoint and restart scheme towards exascale
|
conference | June 2012 |
Similar Records
Failure Masking and Local Recovery for Stencil-based Applications at Extreme Scales.
Local Recovery and Failure Masking for Stencil-based Applications at Extreme Scales.
Modeling and Simulating Multiple Failure Masking enabled by Local Recovery for Stencil-based Applications at Extreme Scales
Conference
·
2015
·
OSTI ID:1244932
+4 more
Local Recovery and Failure Masking for Stencil-based Applications at Extreme Scales.
Conference
·
2015
·
OSTI ID:1291974
+4 more
Modeling and Simulating Multiple Failure Masking enabled by Local Recovery for Stencil-based Applications at Extreme Scales
Journal Article
·
2017
· IEEE Transactions on Parallel and Distributed Systems
·
OSTI ID:1356841
+4 more