Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Understanding large scale HPC systems through scalable monitoring and analysis.

Conference ·
OSTI ID:1028363
As HPC systems grow in size and complexity, diagnosing problems and understanding system behavior, including failure modes, becomes increasingly difficult and time consuming. At Sandia National Laboratories we have developed a tool, OVIS, to facilitate large scale HPC system understanding. OVIS incorporates an intuitive graphical user interface, an extensive and extendable data analysis suite, and a 3-D visualization engine that allows visual inspection of both raw and derived data on a geometrically correct representation of a HPC system. This talk will cover system instrumentation, data collection (including log files and the complications of meaningful parsing), analysis, visualization of both raw and derived information, and how data can be combined to increase system understanding and efficiency.
Research Organization:
Sandia National Laboratories
Sponsoring Organization:
USDOE
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1028363
Report Number(s):
SAND2010-6158C
Country of Publication:
United States
Language:
English

Similar Records

Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems
Journal Article · Tue May 17 20:00:00 EDT 2016 · Parallel Computing · OSTI ID:1263594

Integrated System and Application Continuous Performance Monitoring and Analysis Capability
Technical Report · Wed Sep 01 00:00:00 EDT 2021 · OSTI ID:1819812

Integrated System and Application Continuous Performance Monitoring and Analysis Capability (Final)
Technical Report · Wed Sep 01 00:00:00 EDT 2021 · OSTI ID:1822583