skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Self-stabilizing Connected Components

Abstract

For the problem of computing the connected components of a graph, this paper considers the design of algorithms that are resilient to transient hardware faults, like bit Hips. More specifically, it applies the technique of self-stabilization. A system is self-stabilizing if, when starting from a valid or invalid state, it is guaranteed to reach a valid state after a finite number of steps. Therefore on a machine subject to a transient fault, a self-stabilizing algorithm could recover if that fault caused the system to enter an invalid state. We give a comprehensive analysis of the valid and invalid states during label propagation and derive algorithms to verify and correct the invalid state. The self-stabilizing label-propagation algorithm performs O (V log V) additional computation and requires O (V) additional storage over its conventional counterpart (and, as such, does not increase asymptotic complexity over conventional label propagation). When run against a battery of simulated fault injection tests, the self-stabilizing label propagation algorithm exhibits more resilient behavior than a triple modular redundancy (TMR) based fault-tolerant algorithm in 80% of cases. From a performance perspective, it also outperforms TMR as it requires fewer iterations in total. Beyond the fault tolerance properties of self-stabilizing label-propagation,more » we believe, they are useful from the theoretical perspective; and may have other use-cases.« less

Authors:
ORCiD logo [1]; ORCiD logo [1];  [2];  [2];  [3]
  1. ORNL
  2. Georgia Institute of Technology
  3. Georgia Institute of Technology, Atlanta
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1649445
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: The 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2019 - Denver, Colorado, United States of America - 11/22/2019 5:00:00 AM-
Country of Publication:
United States
Language:
English

Citation Formats

Sao, Piyush, Engelmann, Christian, Eswar, Srinivas, Green, Oded, and Vuduc, Richard. Self-stabilizing Connected Components. United States: N. p., 2019. Web. doi:10.1109/FTXS49593.2019.00011.
Sao, Piyush, Engelmann, Christian, Eswar, Srinivas, Green, Oded, & Vuduc, Richard. Self-stabilizing Connected Components. United States. https://doi.org/10.1109/FTXS49593.2019.00011
Sao, Piyush, Engelmann, Christian, Eswar, Srinivas, Green, Oded, and Vuduc, Richard. 2019. "Self-stabilizing Connected Components". United States. https://doi.org/10.1109/FTXS49593.2019.00011. https://www.osti.gov/servlets/purl/1649445.
@article{osti_1649445,
title = {Self-stabilizing Connected Components},
author = {Sao, Piyush and Engelmann, Christian and Eswar, Srinivas and Green, Oded and Vuduc, Richard},
abstractNote = {For the problem of computing the connected components of a graph, this paper considers the design of algorithms that are resilient to transient hardware faults, like bit Hips. More specifically, it applies the technique of self-stabilization. A system is self-stabilizing if, when starting from a valid or invalid state, it is guaranteed to reach a valid state after a finite number of steps. Therefore on a machine subject to a transient fault, a self-stabilizing algorithm could recover if that fault caused the system to enter an invalid state. We give a comprehensive analysis of the valid and invalid states during label propagation and derive algorithms to verify and correct the invalid state. The self-stabilizing label-propagation algorithm performs O (V log V) additional computation and requires O (V) additional storage over its conventional counterpart (and, as such, does not increase asymptotic complexity over conventional label propagation). When run against a battery of simulated fault injection tests, the self-stabilizing label propagation algorithm exhibits more resilient behavior than a triple modular redundancy (TMR) based fault-tolerant algorithm in 80% of cases. From a performance perspective, it also outperforms TMR as it requires fewer iterations in total. Beyond the fault tolerance properties of self-stabilizing label-propagation, we believe, they are useful from the theoretical perspective; and may have other use-cases.},
doi = {10.1109/FTXS49593.2019.00011},
url = {https://www.osti.gov/biblio/1649445}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2019},
month = {11}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: