Checksumming strategies for data in volatile memories

Arafat, Md H.; Krishnamoorthy, Sriram; Sadayappan, Ponnuswamy

doi:10.1109/ICPPW.2014.41

Title: Checksumming strategies for data in volatile memories

Conference · Tue Sep 09 00:00:00 EDT 2014

DOI:https://doi.org/10.1109/ICPPW.2014.41· OSTI ID:1236931

Arafat, Md H.; Krishnamoorthy, Sriram; Sadayappan, Ponnuswamy

The increase in the number of processors needed to build exascale systems implies that the mean time between failure will further decrease, making it increasingly important to develop scalable techniques for fault tolerance. In this paper we develop an efficient checksum-based approach to fault tolerance for data in volatile memory systems, i.e., an approach without the need to save any data on stable persistent storage. The developed scheme is applicable in multiple scenarios, including: 1) online recovery of large read-only data structures from the memory of failed nodes, with very low storage overhead 2) online recovery from soft errors in blocked data, and 3) online recovery of read/write data via in-memory check-pointing. The approach uses a logical multi-dimension view of the data to be protected. Changing the dimensionality of the data view enables a trade-off between multiple factors, including the storage overheads, the checksum generation time, the failure recovery time, and the number of faults that can be tolerated. Experimental results demonstrating effectiveness are presented on a Cray XE6 system.

OSTI does not have a digital full text copy available. For more information, please see document availability, search WorldCat, or search Google Scholar.

Cite

Export

Save

Research Organization:: Pacific Northwest National Lab. (PNNL), Richland, WA (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-76RL01830

OSTI ID:: 1236931

Report Number(s):: PNNL-SA-103603; KJ0402000

Resource Relation:: Conference: 43rd International Conference on Parallel Processing Workshops (ICCPW 2014), September 9-12, 2014, Minneapolis, Minnesota, 245-254

Country of Publication:: United States

Language:: English

Similar Records

Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems

Technical Report · Wed Nov 26 00:00:00 EST 2014 · OSTI ID:1236931

Schreiber, Robert

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques

Conference · Mon Aug 01 00:00:00 EDT 2016 · OSTI ID:1236931

Gamell, Marc; Katz, Daniel S.; Teranishi, Keita; +4 more

Tolerating Correlated Failures for Generalized Cartesian Distributions via Bipartite Matching

Conference · Thu May 05 00:00:00 EDT 2011 · OSTI ID:1236931

Ali, Nawab; Krishnamoorthy, Sriram; Halappanavar, Mahantesh; +1 more

Title: Checksumming strategies for data in volatile memories

Citation Formats

Similar Records

Related Subjects