Collecting sensor data in a high-performance computing environment: a case-study
- ORNL
Many research questions remain open with regard to improving reliability in exascale systems. Among others, statistics-based analysis has been used to find anomalies, to isolate root causes, and attempt to predict failures. But well-understood methods and best practices for collecting reliability data in a uniform way are still lacking, which impedes analysis. We report our experience with collecting these data from heterogeneous sources on a testbed cluster and present our data collection tool. This case illustrates the fact that reported metrics largely depend upon individual system configuration. We then investigate standards and specifications in manufacturing and desktop computing to identify concepts that may be useful for representing High Performance Computing (HPC) data and present a taxonomy that utilizes these concepts.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- Work for Others (WFO)
- DOE Contract Number:
- DE-AC05-00OR22725
- OSTI ID:
- 986794
- Resource Relation:
- Conference: WORLDCOMP 2010 - PDPTA, Las Vegas, NV, USA, 20100712, 20100715
- Country of Publication:
- United States
- Language:
- English
Similar Records
PRIMA-X - Performance Retargeting of Instrumentation, Measurement, and Analysis Technologies for Exascale Computing
A case for Virtual Machine based Fault Injection in a High-Performance Computing Environment