skip to main content

Title: On Undecidability Aspects of Resilient Computations and Implications to Exascale

Future Exascale computing systems with a large number of processors, memory elements and interconnection links, are expected to experience multiple, complex faults, which affect both applications and operating-runtime systems. A variety of algorithms, frameworks and tools are being proposed to realize and/or verify the resilience properties of computations that guarantee correct results on failure-prone computing systems. We analytically show that certain resilient computation problems in presence of general classes of faults are undecidable, that is, no algorithms exist for solving them. We first show that the membership verification in a generic set of resilient computations is undecidable. We describe classes of faults that can create infinite loops or non-halting computations, whose detection in general is undecidable. We then show certain resilient computation problems to be undecidable by using reductions from the loop detection and halting problems under two formulations, namely, an abstract programming language and Turing machines, respectively. These two reductions highlight different failure effects: the former represents program and data corruption, and the latter illustrates incorrect program execution. These results call for broad-based, well-characterized resilience approaches that complement purely computational solutions using methods such as hardware monitors, co-designs, and system- and application-specific diagnosis codes.
  1. ORNL
Publication Date:
OSTI Identifier:
DOE Contract Number:
Resource Type:
Resource Relation:
Conference: Euro-Par 2014: Parallel Processing Workshops: Resilience 2014, Porto, Portugal, 20140824, 20140828
Research Org:
Oak Ridge National Laboratory (ORNL)
Sponsoring Org:
SC USDOE - Office of Science (SC)
Country of Publication:
United States
Exascale systems; resilient computations; undecidability; uncomputability