skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Staghorn: An Automated Large-Scale Distributed System Analysis Platform

Abstract

Conducting experiments on large-scale distributed computing systems is becoming significantly easier with the assistance of emulation. Researchers can now create a model of a distributed computing environment and then generate a virtual, laboratory copy of the entire system composed of potentially thousands of virtual machines, switches, and software. The use of real software, running at clock rate in full virtual machines, allows experiments to produce meaningful results without necessitating a full understanding of all model components. However, the ability to inspect and modify elements within these models is bound by the limitation that such modifications must compete with the model, either running in or alongside it. This inhibits entire classes of analyses from being conducted upon these models. We developed a mechanism to snapshot an entire emulation-based model as it is running. This allows us to \freeze time" and subsequently fork execution, replay execution, modify arbitrary parts of the model, or deeply explore the model. This snapshot includes capturing packets in transit and other input/output state along with the running virtual machines. We were able to build this system in Linux using Open vSwitch and Kernel Virtual Machines on top of Sandia's emulation platform Firewheel. This primitive opens the doormore » to numerous subsequent analyses on models, including state space exploration, debugging distributed systems, performance optimizations, improved training environments, and improved experiment repeatability.« less

Authors:
 [1];  [1];  [1];  [1];  [1]
  1. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1411885
Report Number(s):
SAND2016-9616
657048
DOE Contract Number:
AC04-94AL85000
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Gabert, Kasimir, Burns, Ian, Elliott, Steven, Kallaher, Jenna, and Vail, Adam. Staghorn: An Automated Large-Scale Distributed System Analysis Platform. United States: N. p., 2016. Web. doi:10.2172/1411885.
Gabert, Kasimir, Burns, Ian, Elliott, Steven, Kallaher, Jenna, & Vail, Adam. Staghorn: An Automated Large-Scale Distributed System Analysis Platform. United States. doi:10.2172/1411885.
Gabert, Kasimir, Burns, Ian, Elliott, Steven, Kallaher, Jenna, and Vail, Adam. 2016. "Staghorn: An Automated Large-Scale Distributed System Analysis Platform". United States. doi:10.2172/1411885. https://www.osti.gov/servlets/purl/1411885.
@article{osti_1411885,
title = {Staghorn: An Automated Large-Scale Distributed System Analysis Platform},
author = {Gabert, Kasimir and Burns, Ian and Elliott, Steven and Kallaher, Jenna and Vail, Adam},
abstractNote = {Conducting experiments on large-scale distributed computing systems is becoming significantly easier with the assistance of emulation. Researchers can now create a model of a distributed computing environment and then generate a virtual, laboratory copy of the entire system composed of potentially thousands of virtual machines, switches, and software. The use of real software, running at clock rate in full virtual machines, allows experiments to produce meaningful results without necessitating a full understanding of all model components. However, the ability to inspect and modify elements within these models is bound by the limitation that such modifications must compete with the model, either running in or alongside it. This inhibits entire classes of analyses from being conducted upon these models. We developed a mechanism to snapshot an entire emulation-based model as it is running. This allows us to \freeze time" and subsequently fork execution, replay execution, modify arbitrary parts of the model, or deeply explore the model. This snapshot includes capturing packets in transit and other input/output state along with the running virtual machines. We were able to build this system in Linux using Open vSwitch and Kernel Virtual Machines on top of Sandia's emulation platform Firewheel. This primitive opens the door to numerous subsequent analyses on models, including state space exploration, debugging distributed systems, performance optimizations, improved training environments, and improved experiment repeatability.},
doi = {10.2172/1411885},
journal = {},
number = ,
volume = ,
place = {United States},
year = 2016,
month = 9
}

Technical Report:

Save / Share:
  • The advent of large-scale collaborative scientific applications has demonstrated the potential for broad scientific communities to pool globally distributed resources to produce unprecedented data acquisition, movement, and analysis. System resources including supercomputers, data repositories, computing facilities, network infrastructures, storage systems, and display devices have been increasingly deployed at national laboratories and academic institutes. These resources are typically shared by large communities of users over Internet or dedicated networks and hence exhibit an inherent dynamic nature in their availability, accessibility, capacity, and stability. Scientific applications using either experimental facilities or computation-based simulations with various physical, chemical, climatic, and biological models featuremore » diverse scientific workflows as simple as linear pipelines or as complex as a directed acyclic graphs, which must be executed and supported over wide-area networks with massively distributed resources. Application users oftentimes need to manually configure their computing tasks over networks in an ad hoc manner, hence significantly limiting the productivity of scientists and constraining the utilization of resources. The success of these large-scale distributed applications requires a highly adaptive and massively scalable workflow platform that provides automated and optimized computing and networking services. This project is to design and develop a generic Scientific Workflow Automation and Management Platform (SWAMP), which contains a web-based user interface specially tailored for a target application, a set of user libraries, and several easy-to-use computing and networking toolkits for application scientists to conveniently assemble, execute, monitor, and control complex computing workflows in heterogeneous high-performance network environments. SWAMP will enable the automation and management of the entire process of scientific workflows with the convenience of a few mouse clicks while hiding the implementation and technical details from end users. Particularly, we will consider two types of applications with distinct performance requirements: data-centric and service-centric applications. For data-centric applications, the main workflow task involves large-volume data generation, catalog, storage, and movement typically from supercomputers or experimental facilities to a team of geographically distributed users; while for service-centric applications, the main focus of workflow is on data archiving, preprocessing, filtering, synthesis, visualization, and other application-specific analysis. We will conduct a comprehensive comparison of existing workflow systems and choose the best suited one with open-source code, a flexible system structure, and a large user base as the starting point for our development. Based on the chosen system, we will develop and integrate new components including a black box design of computing modules, performance monitoring and prediction, and workflow optimization and reconfiguration, which are missing from existing workflow systems. A modular design for separating specification, execution, and monitoring aspects will be adopted to establish a common generic infrastructure suited for a wide spectrum of science applications. We will further design and develop efficient workflow mapping and scheduling algorithms to optimize the workflow performance in terms of minimum end-to-end delay, maximum frame rate, and highest reliability. We will develop and demonstrate the SWAMP system in a local environment, the grid network, and the 100Gpbs Advanced Network Initiative (ANI) testbed. The demonstration will target scientific applications in climate modeling and high energy physics and the functions to be demonstrated include workflow deployment, execution, steering, and reconfiguration. Throughout the project period, we will work closely with the science communities in the fields of climate modeling and high energy physics including Spallation Neutron Source (SNS) and Large Hadron Collider (LHC) projects to mature the system for production use.« less
  • The goal of this project was to develop a tool for facilitating simulation, validation and discovery of multiscale dynamical processes in microbial ecosystems. This led to the development of an open-source software platform for Computation Of Microbial Ecosystems in Time and Space (COMETS). COMETS performs spatially distributed time-dependent flux balance based simulations of microbial metabolism. Our plan involved building the software platform itself, calibrating and testing it through comparison with experimental data, and integrating simulations and experiments to address important open questions on the evolution and dynamics of cross-feeding interactions between microbial species.
  • The effectiveness of a large-scale electric power system can be measured by four factors: system performance, system availability, system cost, and system worth (from the user perspective). In response to the need for synergistic effectiveness measures. A broad, multi-contractor research project is being conducted to integrate those four categories. This report describes system cost at two levels: a conceptual framework for measuring the total cost of producing electricity under diverse system effectiveness measures, and a set of general cost inputs that relate the framework to specific utility types. In this report, Chapter II describes the general-level conceptual framework for assessingmore » the cost of system effectivenss attributes. Chapter III shows how the actual costs of a power system can be disaggregated and then integrated into the broad-level conceptual framework. Chapter IV utilizes the conceptual framework and the concepts underlying its development to produce some concrete examples of measures of cost of system effectiveness. Appendix A is a more in-depth look at the cost of fuel, and illustrates the level of analytical detail necessary for putting the framework into practice.« less
  • The Computational Structural Mechanics (CSM) activity is developing advanced structural analysis and computational methods that exploit high-performance computers. Methods are developed in the framework of the CSM testbed software system and applied to representative complex structural analysis problems from the aerospace industry. An overview of the CSM testbed methods development environment is presented and some numerical methods developed on a CRAY-2 are described. Selected application studies performed on the NAS CRAY-2 are also summarized.
  • Elements of a methodology for large scale system effectiveness analysis with application to electric energy systems are presented. The physical system consisting of generation, transmission, distribution, and end users is decomposed into the supply part and the demand part. Analytical tools for assessing the primitive attributes for each part are introduced. The random feasibility set, the single-node perturbed random feasibility set and the energy service delivery graph are alternative means for characterizing the capabilities for service delivery of the supply system. Attributes for characterizing demands such as the minimal requirement curve and the marginal availability curve are introduced and proceduresmore » for propagating these attributes through a probabilistic radial system and aggregating them are presented. Analysis of the relationships between commensurable attributes of supply and demand forms the basis for defining and evaluating high level attributes (sufficiency, efficiency, and equitability) that are introduced as components of effectiveness. This is the first technical report on a continuing project to develop a conceptual framework for effectiveness analysis.« less