Highly fault-tolerant parallel computation
- MIT, Cambridge, MA (United States)
We re-introduce the coded model of fault-tolerant computation in which the input and output of a computational device are treated as words in an error-correcting code. A computational device correctly computes a function in the coded model if its input and output, once decoded, are a valid input and output of the function. In the coded model, it is reasonable to hope to simulate all computational devices by devices whose size is greater by a constant factor but which are exponentially reliable even if each of their components can fail with some constant probability. We consider fine-grained parallel computations in which each processor has a constant probability of producing the wrong output at each time step. We show that any parallel computation that runs for time t on w processors can be performed reliably on a faulty machine in the coded model using w log{sup O(l)} w processors and time t log{sup O(l)} w. The failure probability of the computation will be at most t {center_dot} exp(-w{sup 1/4}). The codes used to communicate with our fault-tolerant machines are generalized Reed-Solomon codes and can thus be encoded and decoded in O(n log{sup O(1)} n) sequential time and are independent of the machine they are used to communicate with. We also show how coded computation can be used to self-correct many linear functions in parallel with arbitrarily small overhead.
- OSTI ID:
- 457647
- Report Number(s):
- CONF-961004-; TRN: 97:001036-0018
- Resource Relation:
- Conference: 37. annual symposium on foundations of computer science, Burlington, VT (United States), 13-16 Oct 1996; Other Information: PBD: 1996; Related Information: Is Part Of Proceedings of the 37th annual symposium on foundations of computer science; PB: 656 p.
- Country of Publication:
- United States
- Language:
- English
Similar Records
...And Eat it Too: High Read Performance in Write-Optimized HPC I/O Middleware File Formats
Depth optimal sorting networks resistant to k passive faults