skip to main content
DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

This content will become publicly available on June 1, 2020

Title: Production Application Performance Data Streaming for System Monitoring

Abstract

In this article, we present an approach to streaming collection of application performance data. Practical application performance tuning and troubleshooting in production high-performance computing (HPC) environments requires an understanding of how applications interact with the platform, including (but not limited to) parallel programming libraries such as Message Passing Interface (MPI). Several profiling and tracing tools exist that collect heavy runtime data traces either in memory (released only at application exit) or on a file system (imposing an I/O load that may interfere with the performance being measured). Although these approaches are beneficial in development stages and post-run analysis, a systemwide and low-overhead method is required to monitor deployed applications continuously. This method must be able to collect information at both the application and system levels to yield a complete performance picture. In our approach, an application profiler collects application event counters. A sampler uses an efficient inter-process communication method to periodically extract the application counters and stream them into an infrastructure for performance data collection. We implement a tool-set based on our approach and integrate it with the Lightweight Distributed Metric Service (LDMS) system, a monitoring system used on large-scale computational platforms. LDMS provides the infrastructure to create and gathermore » streams of performance data in a low overhead manner. We demonstrate our approach using applications implemented with MPI, as it is one of the most common standards for the development of large-scale scientific applications. We utilize our tool-set to study the impact of our approach on an open source HPC application, Nalu. Our tool-set enables us to efficiently identify patterns in the behavior of the application without source-level knowledge. We leverage LDMS to collect system-level performance data and explore the correlation between the system and application events. Finally, we demonstrate how our tool-set can help detect anomalies with a low latency. We run tests on two different architectures: a system enabled with Intel Xeon Phi and another system equipped with Intel Xeon processor. Our overhead study shows our method imposes at most 0.5% CPU usage overhead on the application in realistic deployment scenarios.« less

Authors:
 [1];  [2];  [1];  [2]
  1. Univ. of Central Florida, Orlando, FL (United States)
  2. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sandia National Lab. (SNL-CA), Livermore, CA (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1570222
Report Number(s):
SAND2019-8609J
Journal ID: ISSN 2376-3639; 677802
Grant/Contract Number:  
AC04-94AL85000
Resource Type:
Accepted Manuscript
Journal Name:
ACM Transactions on Modeling and Performance Evaluation of Computing Systems
Additional Journal Information:
Journal Volume: 4; Journal Issue: 2; Journal ID: ISSN 2376-3639
Publisher:
Association for Computing Machinery
Country of Publication:
United States
Language:
English
Subject:
59 BASIC BIOLOGICAL SCIENCES; application and system monitoring; performance data streaming; application profiling

Citation Formats

Izadpanah, Ramin, Allan, Benjamin A., Dechev, Damian, and Brandt, Jim. Production Application Performance Data Streaming for System Monitoring. United States: N. p., 2019. Web. doi:10.1145/3319498.
Izadpanah, Ramin, Allan, Benjamin A., Dechev, Damian, & Brandt, Jim. Production Application Performance Data Streaming for System Monitoring. United States. doi:10.1145/3319498.
Izadpanah, Ramin, Allan, Benjamin A., Dechev, Damian, and Brandt, Jim. Sat . "Production Application Performance Data Streaming for System Monitoring". United States. doi:10.1145/3319498.
@article{osti_1570222,
title = {Production Application Performance Data Streaming for System Monitoring},
author = {Izadpanah, Ramin and Allan, Benjamin A. and Dechev, Damian and Brandt, Jim},
abstractNote = {In this article, we present an approach to streaming collection of application performance data. Practical application performance tuning and troubleshooting in production high-performance computing (HPC) environments requires an understanding of how applications interact with the platform, including (but not limited to) parallel programming libraries such as Message Passing Interface (MPI). Several profiling and tracing tools exist that collect heavy runtime data traces either in memory (released only at application exit) or on a file system (imposing an I/O load that may interfere with the performance being measured). Although these approaches are beneficial in development stages and post-run analysis, a systemwide and low-overhead method is required to monitor deployed applications continuously. This method must be able to collect information at both the application and system levels to yield a complete performance picture. In our approach, an application profiler collects application event counters. A sampler uses an efficient inter-process communication method to periodically extract the application counters and stream them into an infrastructure for performance data collection. We implement a tool-set based on our approach and integrate it with the Lightweight Distributed Metric Service (LDMS) system, a monitoring system used on large-scale computational platforms. LDMS provides the infrastructure to create and gather streams of performance data in a low overhead manner. We demonstrate our approach using applications implemented with MPI, as it is one of the most common standards for the development of large-scale scientific applications. We utilize our tool-set to study the impact of our approach on an open source HPC application, Nalu. Our tool-set enables us to efficiently identify patterns in the behavior of the application without source-level knowledge. We leverage LDMS to collect system-level performance data and explore the correlation between the system and application events. Finally, we demonstrate how our tool-set can help detect anomalies with a low latency. We run tests on two different architectures: a system enabled with Intel Xeon Phi and another system equipped with Intel Xeon processor. Our overhead study shows our method imposes at most 0.5% CPU usage overhead on the application in realistic deployment scenarios.},
doi = {10.1145/3319498},
journal = {ACM Transactions on Modeling and Performance Evaluation of Computing Systems},
number = 2,
volume = 4,
place = {United States},
year = {2019},
month = {6}
}

Journal Article:
Free Publicly Available Full Text
This content will become publicly available on June 1, 2020
Publisher's Version of Record

Save / Share:

Works referenced in this record:

Observing the clouds: a survey and taxonomy of cloud monitoring
journal, December 2014