Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Lifflander, Jonathan; Meneses, Esteban; Menon, Harshita; Miller, Phil; Krishnamoorthy, Sriram; Kale, Laxmikant

doi:10.1109/CLUSTER.2014.6968739

Title: Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Conference · Mon Sep 22 00:00:00 EDT 2014

DOI:https://doi.org/10.1109/CLUSTER.2014.6968739· OSTI ID:1178512

Lifflander, Jonathan; Meneses, Esteban; Menon, Harshita; Miller, Phil; Krishnamoorthy, Sriram; Kale, Laxmikant

Deterministic replay of a parallel application is commonly used for discovering bugs or to recover from a hard fault with message-logging fault tolerance. For message passing programs, a major source of overhead during forward execution is recording the order in which messages are sent and received. During replay, this ordering must be used to deterministically reproduce the execution. Previous work in replay algorithms often makes minimal assumptions about the programming model and application in order to maintain generality. However, in many cases, only a partial order must be recorded due to determinism intrinsic in the code, ordering constraints imposed by the execution model, and events that are commutative (their relative execution order during replay does not need to be reproduced exactly). In this paper, we present a novel algebraic framework for reasoning about the minimum dependencies required to represent the partial order for different concurrent orderings and interleavings. By exploiting this theory, we improve on an existing scalable message-logging fault tolerance scheme. The improved scheme scales to 131,072 cores on an IBM BlueGene/P with up to 2x lower overhead than one that records a total order.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-76RL01830

OSTI ID:: 1178512

Report Number(s):: PNNL-SA-103978; KJ0402000

Resource Relation:: Conference: IEEE International Conference on Cluster Computing (CLUSTER 2014), September 22-26, 2014, Madrid, Spain, 19-28

Country of Publication:: United States

Language:: English

Similar Records

Adaptive message logging for incremental replay of message-passing programs

Conference · Fri Dec 31 00:00:00 EST 1993 · OSTI ID:1178512

Netzer, R H.B.; Xu, J

Distributed system fault tolerance using sender-based message logging

Technical Report · Mon Jan 01 00:00:00 EST 1990 · OSTI ID:1178512

Johnson, D B; Zwaenepoel, W

Hardware-assisted replay of microprocessor programs

Book · Tue Jan 01 00:00:00 EST 1991 · OSTI ID:1178512

Bacon, D F; Goldstein, S C

Related Subjects

replay
partial-order dependencies
fault tolerance
message logging

Title: Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Citation Formats

Similar Records

Related Subjects