Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Lightweight storage and overlay networks for fault tolerance.

Technical Report ·
DOI:https://doi.org/10.2172/989384· OSTI ID:989384

The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands to millions of processors, In such environments, it is critical to have fault-tolerance mechanisms, including checkpoint/restart, that scale with the size of applications and the percentage of the system on which the applications execute. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a scalable solution. For example, on today's massive-scale systems that execute applications which consume most of the memory of the employed compute nodes, checkpoint operations generate I/O that consumes nearly 80% of the total I/O usage. Motivated by this observation, this project aims to improve I/O performance for application-directed checkpoints through the use of lightweight storage architectures and overlay networks. Lightweight storage provide direct access to underlying storage devices. Overlay networks provide caching and processing capabilities in the compute-node fabric. The combination has potential to signifcantly reduce I/O overhead for large-scale applications. This report describes our combined efforts to model and understand overheads for application-directed checkpoints, as well as implementation and performance analysis of a checkpoint service that uses available compute nodes as a network cache for checkpoint operations.

Research Organization:
Sandia National Laboratories
Sponsoring Organization:
USDOE
DOE Contract Number:
AC04-94AL85000
OSTI ID:
989384
Report Number(s):
SAND2010-0040
Country of Publication:
United States
Language:
English

Similar Records

Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library
Conference · Tue Mar 20 00:00:00 EDT 2012 · OSTI ID:1047769

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
Journal Article · Mon Dec 31 23:00:00 EST 2012 · Scientific Programming · OSTI ID:1197891

The Scalable Checkpoint/Restart Library
Software · Sun Feb 22 19:00:00 EST 2009 · OSTI ID:code-1155