skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Who watches the watchers?: preventing fault in a fault tolerance library

Abstract

The Scalable Checkpoint/Restart library (SCR) was developed and is used by researchers at Lawrence Livermore National Laboratory to provide a fast and efficient method of saving and recovering large applications during runtime on high-performance computing (HPC) systems. Though SCR protects other programs, up until June 2017, nothing was actively protecting SCR. The goal of this project was to automate the building and testing of this library on the varying HPC architectures on which it is used. Our methods centered around the use of a continuous integration tool called Bamboo that allowed for automation agents to be installed on the HPC systems themselves. These agents provided a way for us to establish a new and unique way to automate and customize the allocation of resources and running of tests with CMake’s unit testing framework, CTest, as well as integration testing scripts though an HPC package manager called Spack. These methods provided a parallel environment in which to test the more complex features of SCR. As a result, SCR is now automatically built and tested on several HPC architectures any time changes are made by developers to the library’s source code. The results of these tests are then communicated back to themore » developers for immediate feedback, allowing them to fix functionality of SCR that may have broken. Hours of developers’ time are now being saved from the tedious process of manually testing and debugging, which saves money and allows the SCR project team to focus their efforts towards development. Thus, HPC system users can use SCR in conjunction with their own applications to efficiently and effectively checkpoint and restart as needed with the assurance that SCR itself is functioning properly.« less

Authors:
 [1]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1393364
Report Number(s):
LLNL-TR-738530
DOE Contract Number:
AC52-07NA27344
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE

Citation Formats

Stanavige, C. D. Who watches the watchers?: preventing fault in a fault tolerance library. United States: N. p., 2017. Web. doi:10.2172/1393364.
Stanavige, C. D. Who watches the watchers?: preventing fault in a fault tolerance library. United States. doi:10.2172/1393364.
Stanavige, C. D. 2017. "Who watches the watchers?: preventing fault in a fault tolerance library". United States. doi:10.2172/1393364. https://www.osti.gov/servlets/purl/1393364.
@article{osti_1393364,
title = {Who watches the watchers?: preventing fault in a fault tolerance library},
author = {Stanavige, C. D.},
abstractNote = {The Scalable Checkpoint/Restart library (SCR) was developed and is used by researchers at Lawrence Livermore National Laboratory to provide a fast and efficient method of saving and recovering large applications during runtime on high-performance computing (HPC) systems. Though SCR protects other programs, up until June 2017, nothing was actively protecting SCR. The goal of this project was to automate the building and testing of this library on the varying HPC architectures on which it is used. Our methods centered around the use of a continuous integration tool called Bamboo that allowed for automation agents to be installed on the HPC systems themselves. These agents provided a way for us to establish a new and unique way to automate and customize the allocation of resources and running of tests with CMake’s unit testing framework, CTest, as well as integration testing scripts though an HPC package manager called Spack. These methods provided a parallel environment in which to test the more complex features of SCR. As a result, SCR is now automatically built and tested on several HPC architectures any time changes are made by developers to the library’s source code. The results of these tests are then communicated back to the developers for immediate feedback, allowing them to fix functionality of SCR that may have broken. Hours of developers’ time are now being saved from the tedious process of manually testing and debugging, which saves money and allows the SCR project team to focus their efforts towards development. Thus, HPC system users can use SCR in conjunction with their own applications to efficiently and effectively checkpoint and restart as needed with the assurance that SCR itself is functioning properly.},
doi = {10.2172/1393364},
journal = {},
number = ,
volume = ,
place = {United States},
year = 2017,
month = 9
}

Technical Report:

Save / Share:
  • This document provides a specification of Fenix, a software library compatible with the Message Passing Interface (MPI) to support fault recovery without application shutdown. The library consists of two modules. The first, termed process recovery , restores an application to a consistent state after it has suffered a loss of one or more MPI processes (ranks). The second specifies functions the user can invoke to store application data in Fenix managed redundant storage, and to retrieve it from that storage after process recovery.
  • This paper describes a proposed automatically reconfigurable cellular architecture. The unique feature of this architecture is that the reconfiguration control is distributed within the system. There is no need for global broadcasting of switch settings. This reduces the interconnection complexity and the length of data paths. The system can reconfigure at the request of the applications software or in response to detected faults. This architecture supports fault-tolerant applications since the reconfiguration can be self-triggered from within. The complete reconfiguration process can proceed without external interference.
  • This work has concentrated on developing a unifying framework, under the name UNITY, for studying problem solving in parallel programming independent of specific architectural considerations. A simple model of computation and a logic were proposed to reason about properties of such programs, and problems from a variety of problem areas were studied. A number of transformations were developed that are appropriate for implementations on a variety of architectures: sequential, asynchronous shared-memory, distributed message passing, synchronous parallel with shared memory, systolic arrays, and VLSI chips. The diversity of the application areas and the architectures studied lends credence to the hypothesis thatmore » there is a UNITY to computer programming.« less
  • The scale of parallel computing systems is rapidly approaching dimensions where fault tolerance can no longer be ignored. No matter how reliable the individual components may be, the complexity of these systems results in a significant probability of failure during lengthy computations. In the case of distributed memory multiprocessors, fault tolerance techniques developed for distributed operating systems and applications can be applied also to parallel computations. In the paper we survey some of the principal paradigms for fault-tolerant distributed computing and discuss their relevance to parallel processing. One particular technique--passive replication--is explored in detail as it forms the basis formore » fault tolerance in the Paralex parallel programming environment.« less
  • Sender-based message logging supports transparent fault tolerance in distributed systems in which all communication is through messages and all processes execute deterministically between received messages. It uses a pessimistic message logging protocol that requires no specialized hardware. Sender-based message logging differs from previous message logging methods in that it logs each message in the local volatile memory of the machine from which it was sent, thus greatly reducing the overhead of message logging.