Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Benoit, Anne; Cavelan, Aurélien; Cappello, Franck; Raghavan, Padma; Robert, Yves; Sun, Hongyang

doi:10.1016/j.jpdc.2018.08.002

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Journal Article · Sat Dec 01 04:00:00 EST 2018 · Journal of Parallel and Distributed Computing

DOI:https://doi.org/10.1016/j.jpdc.2018.08.002· OSTI ID:1475194

Benoit, Anne; Cavelan, Aurélien; Cappello, Franck; Raghavan, Padma; Robert, Yves; Sun, Hongyang

This paper provides a model and an analytical study of replication as a technique to cope with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale platforms. Compared with fail-stop errors that are immediately detected when they occur, silent errors require a detection mechanism. To detect silent errors, many application-specific techniques are available, either based on algorithms (e.g., ABFT), invariant preservation or data analytics, but replication remains the most transparent and least intrusive technique. We explore the right level (duplication, triplication or more) of replication for two frameworks: (i) when the platform is subject to only silent errors, and (ii) when the platform is subject to both silent and fail-stop errors. A higher level of replication is more expensive in terms of resource usage but enables to tolerate more errors and to even correct some errors, hence there is a trade-off to be found. Replication is combined with checkpointing and comes with two flavors: process replication and group replication. Process replication applies to message-passing applications with communicating processes. Each process is replicated, and the platform is composed of process pairs, or triplets. Group replication applies to black-box applications, whose parallel execution is replicated several times. The platform is partitioned into two halves (or three thirds). In both scenarios, results are compared before each checkpoint, which is taken only when both results (duplication) or two out of three results (triplication) coincide. Otherwise, one or more silent errors have been detected, and the application rolls back to the last checkpoint, as well as when fail-stop errors have struck. We provide a detailed analytical study for all of these scenarios, with formulas to decide, for each scenario, the optimal parameters as a function of the error rate, checkpoint cost, and platform size. We also report a set of extensive simulation results that nicely corroborates the analytical model.

Research Organization:: Argonne National Laboratory (ANL)

Sponsoring Organization:: National Science Foundation (NSF)

DOE Contract Number:: AC02-06CH11357

OSTI ID:: 1475194

Journal Information:: Journal of Parallel and Distributed Computing, Journal Name: Journal of Parallel and Distributed Computing Journal Issue: C Vol. 122; ISSN 0743-7315

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Similar Records

A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

Journal Article · Wed Jul 17 20:00:00 EDT 2024 · Future Generations Computer Systems · OSTI ID:2406527

Optimistic execution and checkpoint comparison for error recovery in parallel and distributed systems

Technical Report · Fri May 08 00:00:00 EDT 1992 · OSTI ID:7026260

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods

Conference · Tue May 31 00:00:00 EDT 2016 · OSTI ID:1322529

Related Subjects

Checkpointing
Fail-stop errors
Fault tolerance
High-performance computing
Replication
Silent errors

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Citation Formats

Similar Records

Related Subjects