skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: OVIS 3.2 user's guide.

Abstract

This document describes how to obtain, install, use, and enjoy a better life with OVIS version 3.2. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect meaningful conditions and anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 3.2 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It provides an interactive 3-D physicalmore » view in which the cluster elements can be colored by raw or derived element values (e.g., temperatures, memory errors). The visual display allows the user to easily determine abnormal or outlier behaviors. Additionally, it provides search capabilities for certain scheduler logs. The OVIS capabilities were designed to be highly interactive - for example, the job search may drive an analysis which in turn may drive the user generation of a derived value which would then be examined on the physical display. The OVIS project envisions the capabilities of its tools applied to compute cluster monitoring. In the future, integration with the scheduler or resource manager will be included in a release to enable intelligent resource utilization. For example, nodes that are deemed less healthy (i.e., nodes that exhibit outlier behavior with respect to some set of variables shown to be correlated with future failure) can be discovered and assigned to shorter duration or less important jobs. Further, HPC applications with fault-tolerant capabilities would respond to changes in resource health and other OVIS notifications as needed, rather than undertaking preventative measures (e.g. checkpointing) at regular intervals unnecessarily.« less

Authors:
; ; ; ; ; ; ;
Publication Date:
Research Org.:
Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1010855
Report Number(s):
SAND2010-7109
TRN: US201109%%5
DOE Contract Number:  
AC04-94AL85000
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICAL METHODS AND COMPUTING; O CODES; MANUALS; COMPUTERS; MONITORING; PERFORMANCE; USES

Citation Formats

Mayo, Jackson R, Gentile, Ann C, Brandt, James M, Houf, Catherine A, Thompson, David C, Roe, Diana C, Wong, Matthew H, and Pebay, Philippe Pierre. OVIS 3.2 user's guide.. United States: N. p., 2010. Web. doi:10.2172/1010855.
Mayo, Jackson R, Gentile, Ann C, Brandt, James M, Houf, Catherine A, Thompson, David C, Roe, Diana C, Wong, Matthew H, & Pebay, Philippe Pierre. OVIS 3.2 user's guide.. United States. https://doi.org/10.2172/1010855
Mayo, Jackson R, Gentile, Ann C, Brandt, James M, Houf, Catherine A, Thompson, David C, Roe, Diana C, Wong, Matthew H, and Pebay, Philippe Pierre. 2010. "OVIS 3.2 user's guide.". United States. https://doi.org/10.2172/1010855. https://www.osti.gov/servlets/purl/1010855.
@article{osti_1010855,
title = {OVIS 3.2 user's guide.},
author = {Mayo, Jackson R and Gentile, Ann C and Brandt, James M and Houf, Catherine A and Thompson, David C and Roe, Diana C and Wong, Matthew H and Pebay, Philippe Pierre},
abstractNote = {This document describes how to obtain, install, use, and enjoy a better life with OVIS version 3.2. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect meaningful conditions and anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 3.2 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It provides an interactive 3-D physical view in which the cluster elements can be colored by raw or derived element values (e.g., temperatures, memory errors). The visual display allows the user to easily determine abnormal or outlier behaviors. Additionally, it provides search capabilities for certain scheduler logs. The OVIS capabilities were designed to be highly interactive - for example, the job search may drive an analysis which in turn may drive the user generation of a derived value which would then be examined on the physical display. The OVIS project envisions the capabilities of its tools applied to compute cluster monitoring. In the future, integration with the scheduler or resource manager will be included in a release to enable intelligent resource utilization. For example, nodes that are deemed less healthy (i.e., nodes that exhibit outlier behavior with respect to some set of variables shown to be correlated with future failure) can be discovered and assigned to shorter duration or less important jobs. Further, HPC applications with fault-tolerant capabilities would respond to changes in resource health and other OVIS notifications as needed, rather than undertaking preventative measures (e.g. checkpointing) at regular intervals unnecessarily.},
doi = {10.2172/1010855},
url = {https://www.osti.gov/biblio/1010855}, journal = {},
number = ,
volume = ,
place = {United States},
year = {Fri Oct 01 00:00:00 EDT 2010},
month = {Fri Oct 01 00:00:00 EDT 2010}
}