skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: TAUOVERSUPERMON: LOW-OVERHEAD ONLINE PARALLEL PERFORMANCE MONITORING

Abstract

Online or Real-time application performance monitoring allows tracking performance characteristics during execution as opposed to doing so post-mortem. This opens up several possibilities otherwise unavailable such as real-time visualization and application performance steering that can be useful in the context of long-running applications. Two fundamental components that constitute such a performance monitor are the measurement and transport systems. The former captures performance metrics of individual contexts (processes, threads). The latter enables querying the parallel/distributed state from the different contexts and also allows measurement control. As HPC systems grow in size and complexity, the key challenge is to keep the online performance monitor scalable and low overhead while still providing a useful performance reporting capability. We adapt and combine two existing, mature systems - Tuning and Analysis Utility (TAU) and Supermon - to address this problem. Tau performs the measurement while Supermon is used to collect the distributed measurement state. Our experiments show that this novel approach of using a cluster-monitor, Supermon, as the transport for online performance data from Tau leads to very low-overhead application monitoring as well as other beneits unavailable from using a traditional transport such as NFS.

Authors:
 [1];  [1];  [1];  [1];  [1]
  1. Los Alamos National Laboratory
Publication Date:
Research Org.:
Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
OSTI Identifier:
985893
Report Number(s):
LA-UR-07-0662
TRN: US201017%%71
DOE Contract Number:
AC52-06NA25396
Resource Type:
Conference
Resource Relation:
Conference: EUROPAR 2007 ; 200708 ; RENNES
Country of Publication:
United States
Language:
English
Subject:
99; METRICS; MONITORING; MONITORS; PERFORMANCE; TRANSPORT; TUNING

Citation Formats

SOTTILE, MATTHEW JOSEPH, NATARAJ, AROON, MALONY, ALLEN, MORRIS, ALAN, and SHENDE, SAMEER. TAUOVERSUPERMON: LOW-OVERHEAD ONLINE PARALLEL PERFORMANCE MONITORING. United States: N. p., 2007. Web.
SOTTILE, MATTHEW JOSEPH, NATARAJ, AROON, MALONY, ALLEN, MORRIS, ALAN, & SHENDE, SAMEER. TAUOVERSUPERMON: LOW-OVERHEAD ONLINE PARALLEL PERFORMANCE MONITORING. United States.
SOTTILE, MATTHEW JOSEPH, NATARAJ, AROON, MALONY, ALLEN, MORRIS, ALAN, and SHENDE, SAMEER. Tue . "TAUOVERSUPERMON: LOW-OVERHEAD ONLINE PARALLEL PERFORMANCE MONITORING". United States. doi:. https://www.osti.gov/servlets/purl/985893.
@article{osti_985893,
title = {TAUOVERSUPERMON: LOW-OVERHEAD ONLINE PARALLEL PERFORMANCE MONITORING},
author = {SOTTILE, MATTHEW JOSEPH and NATARAJ, AROON and MALONY, ALLEN and MORRIS, ALAN and SHENDE, SAMEER},
abstractNote = {Online or Real-time application performance monitoring allows tracking performance characteristics during execution as opposed to doing so post-mortem. This opens up several possibilities otherwise unavailable such as real-time visualization and application performance steering that can be useful in the context of long-running applications. Two fundamental components that constitute such a performance monitor are the measurement and transport systems. The former captures performance metrics of individual contexts (processes, threads). The latter enables querying the parallel/distributed state from the different contexts and also allows measurement control. As HPC systems grow in size and complexity, the key challenge is to keep the online performance monitor scalable and low overhead while still providing a useful performance reporting capability. We adapt and combine two existing, mature systems - Tuning and Analysis Utility (TAU) and Supermon - to address this problem. Tau performs the measurement while Supermon is used to collect the distributed measurement state. Our experiments show that this novel approach of using a cluster-monitor, Supermon, as the transport for online performance data from Tau leads to very low-overhead application monitoring as well as other beneits unavailable from using a traditional transport such as NFS.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Jan 30 00:00:00 EST 2007},
month = {Tue Jan 30 00:00:00 EST 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • To achieve the exaFLOPS performance within a contain power budget, next supercomputers will feature hundreds of millions of components operating at low- and near-threshold voltage. As the probability that at least one of these components fails during the execution of an application approaches certainty, it seems unrealistic to expect that any run of a scientific application will not experience some performance faults. We believe that there is need of a new generation of light-weight performance and debugging tools that can be used online even during production runs of parallel applications and that can identify performance anomalies during the application execution.more » In this work we propose the design and implementation of a monitoring system that continuously inspects the evolution of run« less
  • The CMS experiment at the LHC features over 2'500 devices that need constant monitoring in order to ensure proper data taking. The monitoring solution has been migrated from Nagios to Icinga, with several useful plugins. The motivations behind the migration and the selection of the plugins are discussed.
  • A set of reactive chemical transport calculations was conducted with the Subsurface Transport Over Reactive Multi-phases (STORM) code to evaluate the long-term performance of a representative low-activity waste glass in a shallow subsurface disposal system located on the Hanford Site. Technetium, the main contributor to a drinking water dose, is assumed to be released congruently with the dissolution of the glass. Sodium is released at a higher rate via a kinetic ion-exchange reaction. Aqueous equilibrium reactions involving sodium and other dissolved glass constituents increase the pH, and hence the rate of glass dissolution. The precipitation of secondary minerals, such asmore » herschelite, can also lower the amount of aqueous dissolved silica, which can increase the rate of glass dissolution. Predicted technetium release rates, however, still remain several orders of magnitude lower than required by drinking water regulations.« less
  • By the end of this decade, literally hundred of millions of dollars will be expended annually on the on-going monitoring and verification (OMV) of compliance with significant multilateral non- proliferation, arms control and disarmament (NACD) agreements and obligations. These multilateral NACD agreements will play a crucial role in the maintenance and enhancement of international security at both a regional and global level. These treaties have in common provisions for monitoring and verification relying on some form of on-site inspection (OSI) as a central mechanism. They virtually ignore the use of overhead imaging which is an affordable, effective -- and flexiblemore » -- means of data collection. The myth exists in the minds of negotiators that overhead remote sensing is too complicated and expensive to be used for NACD purposes. Commercially-available space-based and airborne remote sensing are not competitive but are complementary approaches for monitoring purposes and are affordable. It uses as a model the United Nations Special Commission (UNSCOM) which employs both and has been described in a new United Nations study as a veritable {open_quotes}verification laboratory{close_quotes}. The paper concludes that significant opportunities exist to employ overhead sensors as a major NACD monitoring and verification asset with the result of increased effectiveness and decreased overall costs. 1 fig.« less