Local Recovery and Failure Masking for Stencil-based Applications at Extreme Scales.

Gamell, Marc; Teranishi, Keita; Heroux, Michael A.; Mayo, Jackson; Kolla, Hemanth; Chen, Jacqueline H.; Parashar, Manish

doi:10.1145/2807591.2807672

Title: Local Recovery and Failure Masking for Stencil-based Applications at Extreme Scales.

Conference · Sun Nov 01 00:00:00 EDT 2015

DOI:https://doi.org/10.1145/2807591.2807672· OSTI ID:1332855

Gamell, Marc ^[1]; Teranishi, Keita; Heroux, Michael A.; Mayo, Jackson; Kolla, Hemanth; Chen, Jacqueline H.; Parashar, Manish ^[1]

Rutgers U

Abstract not provided.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Sponsoring Organization:: USDOE Office of Science (SC); USDOE National Nuclear Security Administration (NNSA)

DOE Contract Number:: AC04-94AL85000

OSTI ID:: 1332855

Report Number(s):: SAND2015-10051C; 608212

Resource Relation:: Conference: Proposed for presentation at the The international conference for high performance computing, networking, storage and analysis. held November 15-20, 2015 in Austin, Texas.

Country of Publication:: United States

Language:: English

References (27)

FTI: high performance fault tolerance interface for hybrid systems Bautista-Gomez, Leonardo; Tsuboi, Seiji; Komatitsch, Dimitri Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063427	conference	January 2011
Post-failure recovery of MPI communication capability: Design and rationale Bland, Wesley; Bouteiller, Aurelien; Herault, Thomas The International Journal of High Performance Computing Applications, Vol. 27, Issue 3 https://doi.org/10.1177/1094342013488238	journal	June 2013
Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery Bouteiller, Aurelien; Ropars, Thomas; Bosilca, George 2009 IEEE International Conference on Cluster Computing and Workshops https://doi.org/10.1109/CLUSTR.2009.5289157	conference	August 2009
Coordinated checkpoint versus message log for fault tolerant MPI Proceedings IEEE International Conference on Cluster Computing CLUSTR-03 https://doi.org/10.1109/CLUSTR.2003.1253321	conference	January 2003
Distributed snapshots: determining global states of distributed systems Chandy, K. Mani; Lamport, Leslie ACM Transactions on Computer Systems, Vol. 3, Issue 1 https://doi.org/10.1145/214451.214456	journal	February 1985
Terascale direct numerical simulations of turbulent combustion using S3D Chen, J. H.; Choudhary, A.; de Supinski, B. Computational Science & Discovery, Vol. 2, Issue 1 https://doi.org/10.1088/1749-4699/2/1/015001	journal	January 2009
Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI Coti, Camille; Herault, Thomas; Lemarinier, Pierre SC 2006 Proceedings Supercomputing 2006, ACM/IEEE SC 2006 Conference (SC'06) https://doi.org/10.1109/SC.2006.15	conference	November 2006
A survey of rollback-recovery protocols in message-passing systems Elnozahy, E. N. (Mootaz); Alvisi, Lorenzo; Wang, Yi-Min ACM Computing Surveys, Vol. 34, Issue 3 https://doi.org/10.1145/568522.568525	journal	September 2002
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales Gamell, Marc; Katz, Daniel S.; Kolla, Hemanth SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.78	conference	November 2014
Exploring Failure Recovery for Stencil-based Applications at Extreme Scales Gamell, Marc; Teranishi, Keita; Heroux, Michael A. Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15 https://doi.org/10.1145/2749246.2749260	conference	January 2015
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications Guermouche, Amina; Ropars, Thomas; Brunet, Elisabeth Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.95	conference	May 2011
Berkeley lab checkpoint/restart (BLCR) for Linux clusters Hargrove, Paul H.; Duell, Jason C. Journal of Physics: Conference Series, Vol. 46 https://doi.org/10.1088/1742-6596/46/1/067	journal	September 2006
Toward resilient algorithms and applications Heroux, Michael A. Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale - FTXS '13 https://doi.org/10.1145/2465813.2465814	conference	January 2013
Interconnect agnostic checkpoint/restart in open MPI Hursey, Joshua; Mattox, Timothy I.; Lumsdaine, Andrew Proceedings of the 18th ACM international symposium on High performance distributed computing - HPDC '09 https://doi.org/10.1145/1551609.1551619	conference	January 2009
The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI Hursey, Joshua; Squyres, Jeffrey M.; Mattox, Timothy I. 2007 IEEE International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2007.370605	conference	March 2007
On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance Ibtesham, Dewan; Arnold, Dorian; Bridges, Patrick G. 2012 41st International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2012.45	conference	September 2012
MCREngine: A scalable checkpointing system using data-aware aggregation and compression Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.77	conference	November 2012
Optimizing Checkpoints Using NVM as Virtual Memory Kannan, Sudarsun; Gavrilovska, Ada; Schwan, Karsten 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.69	conference	May 2013
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn 2010 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2010.18	conference	November 2010
Enhancing Checkpoint Performance with Staging IO and SSD Ouyang, Xiangyong; Marcarelli, Sonya; Panda, Dhabaleswar K. 2010 International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI) https://doi.org/10.1109/SNAPI.2010.10	conference	May 2010
CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart Ouyang, Xiangyong; Rajachandrasekar, Raghunath; Besseron, Xavier 2011 International Conference on Parallel Processing (ICPP) https://doi.org/10.1109/ICPP.2011.85	conference	September 2011
A 1 PB/s file system to checkpoint three million MPI tasks Rajachandrasekar, Raghunath; Moody, Adam; Mohror, Kathryn Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13 https://doi.org/10.1145/2493123.2462908	conference	January 2013
FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery Sato, Kento; Moody, Adam; Mohror, Kathryn 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.126	conference	May 2014
DRAM errors in the wild: a large-scale field study Schroeder, Bianca; Pinheiro, Eduardo; Weber, Wolf-Dietrich Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems - SIGMETRICS '09 https://doi.org/10.1145/1555349.1555372	conference	January 2009
Toward Local Failure Local Recovery Resilience Model using MPI-ULFM Teranishi, Keita; Heroux, Michael A. Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14 https://doi.org/10.1145/2642769.2642774	conference	January 2014
Tests and tolerances for high-performance software-implemented fault detection Turmon, M.; Granat, R.; Katz, D. S. IEEE Transactions on Computers, Vol. 52, Issue 5 https://doi.org/10.1109/TC.2003.1197125	journal	May 2003
A scalable double in-memory checkpoint and restart scheme towards exascale Zheng, Gengbin; Kale, Laxmikant V. 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264677	conference	June 2012

Similar Records

Failure Masking and Local Recovery for Stencil-based Applications at Extreme Scales.

Conference · Thu Jan 01 00:00:00 EST 2015 · OSTI ID:1332855

Gamell, Marc; Teranishi, Keita; Heroux, Michael Allen; +4 more

Local Recovery and Failure Masking for Stencil-based Applications at Extreme Scales.

Conference · Sat Aug 01 00:00:00 EDT 2015 · OSTI ID:1332855

Gamell Balmana, Marc; Teranishi, Keita; Heroux, Michael Allen; +4 more

Modeling and Simulating Multiple Failure Masking enabled by Local Recovery for Stencil-based Applications at Extreme Scales

Journal Article · Mon Apr 24 00:00:00 EDT 2017 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1332855

Gamell, Marc; Teranishi, Keita; Mayo, Jackson; +4 more

Title: Local Recovery and Failure Masking for Stencil-based Applications at Extreme Scales.

Citation Formats

References (27)

Similar Records

Related Subjects