Understanding large scale HPC systems through scalable monitoring and analysis.
As HPC systems grow in size and complexity, diagnosing problems and understanding system behavior, including failure modes, becomes increasingly difficult and time consuming. At Sandia National Laboratories we have developed a tool, OVIS, to facilitate large scale HPC system understanding. OVIS incorporates an intuitive graphical user interface, an extensive and extendable data analysis suite, and a 3-D visualization engine that allows visual inspection of both raw and derived data on a geometrically correct representation of a HPC system. This talk will cover system instrumentation, data collection (including log files and the complications of meaningful parsing), analysis, visualization of both raw and derived information, and how data can be combined to increase system understanding and efficiency.
- Research Organization:
- Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1028363
- Report Number(s):
- SAND2010-6158C; TRN: US201122%%231
- Resource Relation:
- Conference: Proposed for presentation at the European Grid Initiative Technical Forum held September 13-17, 2010 in Amsterdam, Netherlands
- Country of Publication:
- United States
- Language:
- English
Similar Records
Characterization and identification of HPC applications at leadership computing facility
Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems