skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Highly fault-tolerant parallel computation

Conference ·
OSTI ID:457647
 [1]
  1. MIT, Cambridge, MA (United States)

We re-introduce the coded model of fault-tolerant computation in which the input and output of a computational device are treated as words in an error-correcting code. A computational device correctly computes a function in the coded model if its input and output, once decoded, are a valid input and output of the function. In the coded model, it is reasonable to hope to simulate all computational devices by devices whose size is greater by a constant factor but which are exponentially reliable even if each of their components can fail with some constant probability. We consider fine-grained parallel computations in which each processor has a constant probability of producing the wrong output at each time step. We show that any parallel computation that runs for time t on w processors can be performed reliably on a faulty machine in the coded model using w log{sup O(l)} w processors and time t log{sup O(l)} w. The failure probability of the computation will be at most t {center_dot} exp(-w{sup 1/4}). The codes used to communicate with our fault-tolerant machines are generalized Reed-Solomon codes and can thus be encoded and decoded in O(n log{sup O(1)} n) sequential time and are independent of the machine they are used to communicate with. We also show how coded computation can be used to self-correct many linear functions in parallel with arbitrarily small overhead.

OSTI ID:
457647
Report Number(s):
CONF-961004-; TRN: 97:001036-0018
Resource Relation:
Conference: 37. annual symposium on foundations of computer science, Burlington, VT (United States), 13-16 Oct 1996; Other Information: PBD: 1996; Related Information: Is Part Of Proceedings of the 37th annual symposium on foundations of computer science; PB: 656 p.
Country of Publication:
United States
Language:
English