Understanding large scale HPC systems through scalable monitoring and analysis.
Conference
·
OSTI ID:1028363
As HPC systems grow in size and complexity, diagnosing problems and understanding system behavior, including failure modes, becomes increasingly difficult and time consuming. At Sandia National Laboratories we have developed a tool, OVIS, to facilitate large scale HPC system understanding. OVIS incorporates an intuitive graphical user interface, an extensive and extendable data analysis suite, and a 3-D visualization engine that allows visual inspection of both raw and derived data on a geometrically correct representation of a HPC system. This talk will cover system instrumentation, data collection (including log files and the complications of meaningful parsing), analysis, visualization of both raw and derived information, and how data can be combined to increase system understanding and efficiency.
- Research Organization:
- Sandia National Laboratories
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1028363
- Report Number(s):
- SAND2010-6158C
- Country of Publication:
- United States
- Language:
- English
Similar Records
Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems
Integrated System and Application Continuous Performance Monitoring and Analysis Capability
Integrated System and Application Continuous Performance Monitoring and Analysis Capability (Final)
Journal Article
·
Tue May 17 20:00:00 EDT 2016
· Parallel Computing
·
OSTI ID:1263594
Integrated System and Application Continuous Performance Monitoring and Analysis Capability
Technical Report
·
Wed Sep 01 00:00:00 EDT 2021
·
OSTI ID:1819812
Integrated System and Application Continuous Performance Monitoring and Analysis Capability (Final)
Technical Report
·
Wed Sep 01 00:00:00 EDT 2021
·
OSTI ID:1822583