Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Failure Recovery in Resilient X10

Journal Article · · ACM Transactions on Programming Languages and Systems
DOI:https://doi.org/10.1145/3332372· OSTI ID:1611172
 [1];  [2];  [1];  [1];  [3];  [4];  [5];  [1];  [3];  [1]
  1. IBM T. J. Watson Research Center, Yorktown Heights, NY
  2. Australian National University, Sorbonne Université, and INRIA Paris, France
  3. IBM Research-Tokyo, Chuo-ku, Tokyo, Japan
  4. Australian National University, Canberra, Australia
  5. Goldman Sachs, NewYork, NY

Not provided.

Research Organization:
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States)
Sponsoring Organization:
USDOE Office of Science (SC)
DOE Contract Number:
SC0008923
OSTI ID:
1611172
Journal Information:
ACM Transactions on Programming Languages and Systems, Vol. 41, Issue 3; ISSN 0164-0925
Publisher:
Association for Computing Machinery
Country of Publication:
United States
Language:
English

References (37)

Parallel Programming with Migratable Objects: Charm++ in Practice
  • Acun, Bilge; Gupta, Abhishek; Jain, Nikhil
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.58
conference November 2014
MillWheel: fault-tolerant stream processing at internet scale journal August 2013
Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver conference May 2014
Spark SQL: Relational Data Processing in Spark conference January 2015
Algorithm-based fault tolerance applied to high performance computing journal April 2009
HaLoop: efficient iterative data processing on large clusters journal September 2010
Orleans: cloud computing for everyone conference January 2011
Habanero-Java: the new adventures of old X10 conference January 2011
X10: an object-oriented approach to non-uniform cluster computing
  • Charles, Philippe; Grothoff, Christian; Saraswat, Vijay
  • Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming systems languages and applications - OOPSLA '05 https://doi.org/10.1145/1094811.1094852
conference January 2005
Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience journal January 2015
EventWave: programming model and runtime support for tightly-coupled elastic cloud applications conference January 2013
Resilient X10: efficient failure-aware programming
  • Cunningham, David; Grove, David; Herta, Benjamin
  • Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14 https://doi.org/10.1145/2555243.2555248
conference January 2014
A survey of rollback-recovery protocols in message-passing systems journal September 2002
A Robust Fault Tolerance Scheme for Lifeline-Based Taskpools conference August 2016
Towards an efficient fault-tolerance scheme for GLB conference January 2015
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
  • Guermouche, Amina; Ropars, Thomas; Brunet, Elisabeth
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.95
conference May 2011
Resilient X10 over MPI user level failure mitigation conference January 2016
A Resilient Framework for Iterative Linear Algebra Applications in X10 conference May 2015
LULESH 2.0 Updates and Changes report July 2013
HabaneroUPC++: a Compiler-free PGAS Library
  • Kumar, Vivek; Zheng, Yili; Cavé, Vincent
  • Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models - PGAS '14 https://doi.org/10.1145/2676870.2676879
conference January 2014
Least squares quantization in PCM journal March 1982
Distributed GraphLab: a framework for machine learning and data mining in the cloud journal April 2012
Pregel: a system for large-scale graph processing conference January 2010
Transparently Resilient Task Parallelism for Chapel conference May 2016
A decade of progress in parallel programming productivity journal October 2014
Probabilistic accuracy bounds for fault-tolerant computations that discard tasks conference January 2006
Lifeline-based global load balancing
  • Saraswat, Vijay A.; Kambadur, Prabhanjan; Kodali, Sreedhar
  • Proceedings of the 16th ACM symposium on Principles and practice of parallel programming - PPoPP '11 https://doi.org/10.1145/1941553.1941582
conference January 2011
Fail-stop processors: an approach to designing fault-tolerant computing systems journal August 1983
M3R: increased performance for in-memory Hadoop jobs journal August 2012
X10 and APGAS at Petascale
  • Tardieu, Olivier; Herta, Benjamin; Cunningham, David
  • Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '14 https://doi.org/10.1145/2555243.2555245
conference January 2014
Apache Hadoop YARN: yet another resource negotiator conference January 2013
Reliability with Erlang journal November 2007
Managing Asynchronous Operations in Coarray Fortran 2.0
  • Yang, Chaoran; Murthy, Karthik; Mellor-Crummey, John
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing https://doi.org/10.1109/IPDPS.2013.17
conference May 2013
A first order approximation to the optimum checkpoint interval journal September 1974
GLB: lifeline-based global load balancing library in x10 conference January 2014
A scalable double in-memory checkpoint and restart scheme towards exascale
  • Zheng, Gengbin; Kale, Laxmikant V.
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264677
conference June 2012
UPC++: A PGAS Extension for C++
  • Zheng, Yili; Kamil, Amir; Driscoll, Michael B.
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.115
conference May 2014

Similar Records

A resilient network recovery framework against cascading failures with deep graph learning
Journal Article · 2022 · Proceedings of the Institution of Mechanical Engineers. Part O, Journal of Risk and Reliability · OSTI ID:1965232

Toward Local Failure Local Recovery (LFLR) Resilience Model Using MPI-ULFM.
Conference · 2014 · OSTI ID:1502623

Toward Local Failure Local Recovery Resilience Model using MPI-ULFM.
Conference · 2014 · OSTI ID:1319632

Related Subjects