skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Who watches the watchers?: preventing fault in a fault tolerance library

Abstract

The Scalable Checkpoint/Restart library (SCR) was developed and is used by researchers at Lawrence Livermore National Laboratory to provide a fast and efficient method of saving and recovering large applications during runtime on high-performance computing (HPC) systems. Though SCR protects other programs, up until June 2017, nothing was actively protecting SCR. The goal of this project was to automate the building and testing of this library on the varying HPC architectures on which it is used. Our methods centered around the use of a continuous integration tool called Bamboo that allowed for automation agents to be installed on the HPC systems themselves. These agents provided a way for us to establish a new and unique way to automate and customize the allocation of resources and running of tests with CMake’s unit testing framework, CTest, as well as integration testing scripts though an HPC package manager called Spack. These methods provided a parallel environment in which to test the more complex features of SCR. As a result, SCR is now automatically built and tested on several HPC architectures any time changes are made by developers to the library’s source code. The results of these tests are then communicated back to themore » developers for immediate feedback, allowing them to fix functionality of SCR that may have broken. Hours of developers’ time are now being saved from the tedious process of manually testing and debugging, which saves money and allows the SCR project team to focus their efforts towards development. Thus, HPC system users can use SCR in conjunction with their own applications to efficiently and effectively checkpoint and restart as needed with the assurance that SCR itself is functioning properly.« less

Authors:
 [1]
  1. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Publication Date:
Research Org.:
Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1393364
Report Number(s):
LLNL-TR-738530
DOE Contract Number:  
AC52-07NA27344
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE

Citation Formats

Stanavige, C. D. Who watches the watchers?: preventing fault in a fault tolerance library. United States: N. p., 2017. Web. doi:10.2172/1393364.
Stanavige, C. D. Who watches the watchers?: preventing fault in a fault tolerance library. United States. doi:10.2172/1393364.
Stanavige, C. D. Thu . "Who watches the watchers?: preventing fault in a fault tolerance library". United States. doi:10.2172/1393364. https://www.osti.gov/servlets/purl/1393364.
@article{osti_1393364,
title = {Who watches the watchers?: preventing fault in a fault tolerance library},
author = {Stanavige, C. D.},
abstractNote = {The Scalable Checkpoint/Restart library (SCR) was developed and is used by researchers at Lawrence Livermore National Laboratory to provide a fast and efficient method of saving and recovering large applications during runtime on high-performance computing (HPC) systems. Though SCR protects other programs, up until June 2017, nothing was actively protecting SCR. The goal of this project was to automate the building and testing of this library on the varying HPC architectures on which it is used. Our methods centered around the use of a continuous integration tool called Bamboo that allowed for automation agents to be installed on the HPC systems themselves. These agents provided a way for us to establish a new and unique way to automate and customize the allocation of resources and running of tests with CMake’s unit testing framework, CTest, as well as integration testing scripts though an HPC package manager called Spack. These methods provided a parallel environment in which to test the more complex features of SCR. As a result, SCR is now automatically built and tested on several HPC architectures any time changes are made by developers to the library’s source code. The results of these tests are then communicated back to the developers for immediate feedback, allowing them to fix functionality of SCR that may have broken. Hours of developers’ time are now being saved from the tedious process of manually testing and debugging, which saves money and allows the SCR project team to focus their efforts towards development. Thus, HPC system users can use SCR in conjunction with their own applications to efficiently and effectively checkpoint and restart as needed with the assurance that SCR itself is functioning properly.},
doi = {10.2172/1393364},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Thu Sep 14 00:00:00 EDT 2017},
month = {Thu Sep 14 00:00:00 EDT 2017}
}

Technical Report:

Save / Share: