Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Tolerating Correlated Failures for Generalized Cartesian Distributions via Bipartite Matching

Conference ·
Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. A key ingredient of any approach to fault tolerance is effective support for fault tolerant data storage. A typical application execution consists of phases in which certain data structures are modified while others are read-only. Often, read-only data structures constitute a large fraction of total memory consumed. Fault tolerance for read-only data can be ensured through the use of checksums or parities, without resorting to expensive in-memory duplication or checkpointing to secondary storage. In this paper, we present a graph-matching approach to compute and store parity data for read-only matrices that are compatible with fault tolerant linear algebra (FTLA). Typical approaches only support blocked data distributions with each process holding one block with the parity located on additional processes. The matrices are assumed to be blocked by a cartesian grid with each block assigned to a process. We consider a generalized distribution in which each process can be assigned arbitrary blocks. We also account for the fact that multiple processes might be part of the same failure unit, say an SMP node. The flexibility enabled by our novel application of graph matching extends fault tolerance support to data distributions beyond those supported by prior work. We evaluate the matching implementations and cost to compute the parity and recover lost data, demonstrating the low overhead incurred by our approach.
Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (US)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1030865
Report Number(s):
PNNL-SA-76095; KJ0402000
Country of Publication:
United States
Language:
English

Similar Records

Multi-fault Tolerance for Cartesian Data Distributions
Journal Article · Sat Jun 01 00:00:00 EDT 2013 · International Journal of Parallel Programming, 41(3):469-493 · OSTI ID:1064566

Checksumming strategies for data in volatile memories
Conference · Tue Sep 09 00:00:00 EDT 2014 · OSTI ID:1236931

Error detection and correction utilizing locally stored parity information
Patent · Tue Apr 02 00:00:00 EDT 2019 · OSTI ID:1568154