TAUOVERSUPERMON: LOW-OVERHEAD ONLINE PARALLEL PERFORMANCE MONITORING
- Los Alamos National Laboratory
Online or Real-time application performance monitoring allows tracking performance characteristics during execution as opposed to doing so post-mortem. This opens up several possibilities otherwise unavailable such as real-time visualization and application performance steering that can be useful in the context of long-running applications. Two fundamental components that constitute such a performance monitor are the measurement and transport systems. The former captures performance metrics of individual contexts (processes, threads). The latter enables querying the parallel/distributed state from the different contexts and also allows measurement control. As HPC systems grow in size and complexity, the key challenge is to keep the online performance monitor scalable and low overhead while still providing a useful performance reporting capability. We adapt and combine two existing, mature systems - Tuning and Analysis Utility (TAU) and Supermon - to address this problem. Tau performs the measurement while Supermon is used to collect the distributed measurement state. Our experiments show that this novel approach of using a cluster-monitor, Supermon, as the transport for online performance data from Tau leads to very low-overhead application monitoring as well as other beneits unavailable from using a traditional transport such as NFS.
- Research Organization:
- Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- DOE Contract Number:
- AC52-06NA25396
- OSTI ID:
- 985893
- Report Number(s):
- LA-UR-07-0662; TRN: US201017%%71
- Resource Relation:
- Conference: EUROPAR 2007 ; 200708 ; RENNES
- Country of Publication:
- United States
- Language:
- English
Similar Records
Toward an optimal online checkpoint solution under a two-level HPC checkpoint model
Towards Lightweight Data Integration Using Multi-Workflow Provenance and Data Observability