skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Local Recovery and Failure Masking for Stencil-based Applications at Extreme Scales.

Conference ·

Abstract not provided.

Research Organization:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE Office of Science (SC); USDOE National Nuclear Security Administration (NNSA)
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1332855
Report Number(s):
SAND2015-10051C; 608212
Resource Relation:
Conference: Proposed for presentation at the The international conference for high performance computing, networking, storage and analysis. held November 15-20, 2015 in Austin, Texas.
Country of Publication:
United States
Language:
English

References (27)

FTI: high performance fault tolerance interface for hybrid systems
  • Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427
conference January 2011
Post-failure recovery of MPI communication capability: Design and rationale journal June 2013
Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery conference August 2009
Coordinated checkpoint versus message log for fault tolerant MPI conference January 2003
Distributed snapshots: determining global states of distributed systems journal February 1985
Terascale direct numerical simulations of turbulent combustion using S3D journal January 2009
Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI conference November 2006
A survey of rollback-recovery protocols in message-passing systems journal September 2002
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
  • Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78
conference November 2014
Exploring Failure Recovery for Stencil-based Applications at Extreme Scales
  • Gamell, Marc; Teranishi, Keita; Heroux, Michael A.
  • Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15 https://doi.org/10.1145/2749246.2749260
conference January 2015
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
  • Guermouche, Amina; Ropars, Thomas; Brunet, Elisabeth
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.95
conference May 2011
Berkeley lab checkpoint/restart (BLCR) for Linux clusters journal September 2006
Toward resilient algorithms and applications conference January 2013
Interconnect agnostic checkpoint/restart in open MPI
  • Hursey, Joshua; Mattox, Timothy I.; Lumsdaine, Andrew
  • Proceedings of the 18th ACM international symposium on High performance distributed computing - HPDC '09 https://doi.org/10.1145/1551609.1551619
conference January 2009
The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI conference March 2007
On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance conference September 2012
MCREngine: A scalable checkpointing system using data-aware aggregation and compression
  • Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.77
conference November 2012
Optimizing Checkpoints Using NVM as Virtual Memory
  • Kannan, Sudarsun; Gavrilovska, Ada; Schwan, Karsten
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.69
conference May 2013
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
  • Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn
  • 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.18
conference November 2010
Enhancing Checkpoint Performance with Staging IO and SSD
  • Ouyang, Xiangyong; Marcarelli, Sonya; Panda, Dhabaleswar K.
  • 2010 International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI) https://doi.org/10.1109/SNAPI.2010.10
conference May 2010
CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart conference September 2011
A 1 PB/s file system to checkpoint three million MPI tasks
  • Rajachandrasekar, Raghunath; Moody, Adam; Mohror, Kathryn
  • Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13 https://doi.org/10.1145/2493123.2462908
conference January 2013
FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery
  • Sato, Kento; Moody, Adam; Mohror, Kathryn
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.126
conference May 2014
DRAM errors in the wild: a large-scale field study
  • Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich
  • Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems - SIGMETRICS '09 https://doi.org/10.1145/1555349.1555372
conference January 2009
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM conference January 2014
Tests and tolerances for high-performance software-implemented fault detection journal May 2003
A scalable double in-memory checkpoint and restart scheme towards exascale
  • Zheng, Gengbin; Kale, Laxmikant V.
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264677
conference June 2012

Related Subjects