skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Staghorn: An Automated Large-Scale Distributed System Analysis Platform

Abstract

Conducting experiments on large-scale distributed computing systems is becoming significantly easier with the assistance of emulation. Researchers can now create a model of a distributed computing environment and then generate a virtual, laboratory copy of the entire system composed of potentially thousands of virtual machines, switches, and software. The use of real software, running at clock rate in full virtual machines, allows experiments to produce meaningful results without necessitating a full understanding of all model components. However, the ability to inspect and modify elements within these models is bound by the limitation that such modifications must compete with the model, either running in or alongside it. This inhibits entire classes of analyses from being conducted upon these models. We developed a mechanism to snapshot an entire emulation-based model as it is running. This allows us to \freeze time" and subsequently fork execution, replay execution, modify arbitrary parts of the model, or deeply explore the model. This snapshot includes capturing packets in transit and other input/output state along with the running virtual machines. We were able to build this system in Linux using Open vSwitch and Kernel Virtual Machines on top of Sandia's emulation platform Firewheel. This primitive opens the doormore » to numerous subsequent analyses on models, including state space exploration, debugging distributed systems, performance optimizations, improved training environments, and improved experiment repeatability.« less

Authors:
 [1];  [1];  [1];  [1];  [1]
  1. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1411885
Report Number(s):
SAND2016-9616
657048
DOE Contract Number:  
AC04-94AL85000
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Gabert, Kasimir, Burns, Ian, Elliott, Steven, Kallaher, Jenna, and Vail, Adam. Staghorn: An Automated Large-Scale Distributed System Analysis Platform. United States: N. p., 2016. Web. doi:10.2172/1411885.
Gabert, Kasimir, Burns, Ian, Elliott, Steven, Kallaher, Jenna, & Vail, Adam. Staghorn: An Automated Large-Scale Distributed System Analysis Platform. United States. https://doi.org/10.2172/1411885
Gabert, Kasimir, Burns, Ian, Elliott, Steven, Kallaher, Jenna, and Vail, Adam. 2016. "Staghorn: An Automated Large-Scale Distributed System Analysis Platform". United States. https://doi.org/10.2172/1411885. https://www.osti.gov/servlets/purl/1411885.
@article{osti_1411885,
title = {Staghorn: An Automated Large-Scale Distributed System Analysis Platform},
author = {Gabert, Kasimir and Burns, Ian and Elliott, Steven and Kallaher, Jenna and Vail, Adam},
abstractNote = {Conducting experiments on large-scale distributed computing systems is becoming significantly easier with the assistance of emulation. Researchers can now create a model of a distributed computing environment and then generate a virtual, laboratory copy of the entire system composed of potentially thousands of virtual machines, switches, and software. The use of real software, running at clock rate in full virtual machines, allows experiments to produce meaningful results without necessitating a full understanding of all model components. However, the ability to inspect and modify elements within these models is bound by the limitation that such modifications must compete with the model, either running in or alongside it. This inhibits entire classes of analyses from being conducted upon these models. We developed a mechanism to snapshot an entire emulation-based model as it is running. This allows us to \freeze time" and subsequently fork execution, replay execution, modify arbitrary parts of the model, or deeply explore the model. This snapshot includes capturing packets in transit and other input/output state along with the running virtual machines. We were able to build this system in Linux using Open vSwitch and Kernel Virtual Machines on top of Sandia's emulation platform Firewheel. This primitive opens the door to numerous subsequent analyses on models, including state space exploration, debugging distributed systems, performance optimizations, improved training environments, and improved experiment repeatability.},
doi = {10.2172/1411885},
url = {https://www.osti.gov/biblio/1411885}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2016},
month = {9}
}