Production Application Performance Data Streaming for System Monitoring
Abstract
In this article, we present an approach to streaming collection of application performance data. Practical application performance tuning and troubleshooting in production high-performance computing (HPC) environments requires an understanding of how applications interact with the platform, including (but not limited to) parallel programming libraries such as Message Passing Interface (MPI). Several profiling and tracing tools exist that collect heavy runtime data traces either in memory (released only at application exit) or on a file system (imposing an I/O load that may interfere with the performance being measured). Although these approaches are beneficial in development stages and post-run analysis, a systemwide and low-overhead method is required to monitor deployed applications continuously. This method must be able to collect information at both the application and system levels to yield a complete performance picture. In our approach, an application profiler collects application event counters. A sampler uses an efficient inter-process communication method to periodically extract the application counters and stream them into an infrastructure for performance data collection. We implement a tool-set based on our approach and integrate it with the Lightweight Distributed Metric Service (LDMS) system, a monitoring system used on large-scale computational platforms. LDMS provides the infrastructure to create and gathermore »
- Authors:
-
- Univ. of Central Florida, Orlando, FL (United States)
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Publication Date:
- Research Org.:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Sandia National Lab. (SNL-CA), Livermore, CA (United States)
- Sponsoring Org.:
- USDOE National Nuclear Security Administration (NNSA)
- OSTI Identifier:
- 1570222
- Report Number(s):
- SAND2019-8609J
Journal ID: ISSN 2376-3639; 677802
- Grant/Contract Number:
- AC04-94AL85000
- Resource Type:
- Accepted Manuscript
- Journal Name:
- ACM Transactions on Modeling and Performance Evaluation of Computing Systems
- Additional Journal Information:
- Journal Volume: 4; Journal Issue: 2; Journal ID: ISSN 2376-3639
- Publisher:
- Association for Computing Machinery
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 59 BASIC BIOLOGICAL SCIENCES; application and system monitoring; performance data streaming; application profiling
Citation Formats
Izadpanah, Ramin, Allan, Benjamin A., Dechev, Damian, and Brandt, Jim. Production Application Performance Data Streaming for System Monitoring. United States: N. p., 2019.
Web. doi:10.1145/3319498.
Izadpanah, Ramin, Allan, Benjamin A., Dechev, Damian, & Brandt, Jim. Production Application Performance Data Streaming for System Monitoring. United States. https://doi.org/10.1145/3319498
Izadpanah, Ramin, Allan, Benjamin A., Dechev, Damian, and Brandt, Jim. Sat .
"Production Application Performance Data Streaming for System Monitoring". United States. https://doi.org/10.1145/3319498. https://www.osti.gov/servlets/purl/1570222.
@article{osti_1570222,
title = {Production Application Performance Data Streaming for System Monitoring},
author = {Izadpanah, Ramin and Allan, Benjamin A. and Dechev, Damian and Brandt, Jim},
abstractNote = {In this article, we present an approach to streaming collection of application performance data. Practical application performance tuning and troubleshooting in production high-performance computing (HPC) environments requires an understanding of how applications interact with the platform, including (but not limited to) parallel programming libraries such as Message Passing Interface (MPI). Several profiling and tracing tools exist that collect heavy runtime data traces either in memory (released only at application exit) or on a file system (imposing an I/O load that may interfere with the performance being measured). Although these approaches are beneficial in development stages and post-run analysis, a systemwide and low-overhead method is required to monitor deployed applications continuously. This method must be able to collect information at both the application and system levels to yield a complete performance picture. In our approach, an application profiler collects application event counters. A sampler uses an efficient inter-process communication method to periodically extract the application counters and stream them into an infrastructure for performance data collection. We implement a tool-set based on our approach and integrate it with the Lightweight Distributed Metric Service (LDMS) system, a monitoring system used on large-scale computational platforms. LDMS provides the infrastructure to create and gather streams of performance data in a low overhead manner. We demonstrate our approach using applications implemented with MPI, as it is one of the most common standards for the development of large-scale scientific applications. We utilize our tool-set to study the impact of our approach on an open source HPC application, Nalu. Our tool-set enables us to efficiently identify patterns in the behavior of the application without source-level knowledge. We leverage LDMS to collect system-level performance data and explore the correlation between the system and application events. Finally, we demonstrate how our tool-set can help detect anomalies with a low latency. We run tests on two different architectures: a system enabled with Intel Xeon Phi and another system equipped with Intel Xeon processor. Our overhead study shows our method imposes at most 0.5% CPU usage overhead on the application in realistic deployment scenarios.},
doi = {10.1145/3319498},
journal = {ACM Transactions on Modeling and Performance Evaluation of Computing Systems},
number = 2,
volume = 4,
place = {United States},
year = {2019},
month = {6}
}
Works referenced in this record:
Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results
conference, January 2015
- Hoefler, Torsten; Belli, Roberto
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15
Observing the clouds: a survey and taxonomy of cloud monitoring
journal, December 2014
- Ward, Jonathan Stuart; Barker, Adam
- Journal of Cloud Computing, Vol. 3, Issue 1
The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
conference, November 2014
- Agelastos, Anthony; Allan, Benjamin; Brandt, Jim
- SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems
journal, October 2016
- Agelastos, Anthony; Allan, Benjamin; Brandt, Jim
- Parallel Computing, Vol. 58
Benchmarking the effects of operating system interference on extreme-scale parallel machines
journal, January 2008
- Beckman, Pete; Iskra, Kamil; Yoshii, Kazutomo
- Cluster Computing, Vol. 11, Issue 1
There goes the neighborhood: performance degradation due to nearby jobs
conference, January 2013
- Bhatele, Abhinav; Mohror, Kathryn; Langer, Steven H.
- Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13
Large-Scale Persistent Numerical Data Source Monitoring System Experiences
conference, May 2016
- Brandt, J.; Gentile, A.; Showerman, M.
- 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
A Portable Programming Interface for Performance Evaluation on Modern Processors
journal, August 2000
- Browne, S.; Dongarra, J.; Garner, N.
- The International Journal of High Performance Computing Applications, Vol. 14, Issue 3
Run-to-run variability on Xeon Phi based cray XC systems
conference, November 2017
- Chunduri, Sudheer; Harms, Kevin; Parker, Scott
- SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Extending LDMS to Enable Performance Monitoring in Multi-core Applications
conference, September 2015
- Feldman, Steven; Zhang, Deli; Dechev, Damian
- 2015 IEEE International Conference on Cluster Computing (CLUSTER)
Reconciling Sampling and Direct Instrumentation for Unintrusive Call-Path Profiling of MPI Programs
conference, May 2011
- Szebenyi, Zolt´n; Gamblin, Todd; Schulz, Martin
- Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium
Gprof: A call graph execution profiler
journal, June 1982
- Graham, Susan L.; Kessler, Peter B.; Mckusick, Marshall K.
- ACM SIGPLAN Notices, Vol. 17, Issue 6
A Methodology for Performance Analysis of Non-blocking Algorithms Using Hardware and Software Metrics
conference, May 2016
- Izadpanah, Ramin; Feldman, Steven; Dechev, Damian
- 2016 IEEE 19th International Symposium on Real-Time Distributed Computing (ISORC)
Integrating Low-latency Analysis into HPC System Monitoring
conference, August 2018
- Izadpanah, Ramin; Naksinehaboon, Nichamon; Brandt, Jim
- ICPP 2018: 47th International Conference on Parallel Processing, Proceedings of the 47th International Conference on Parallel Processing
LibPowerMon: A Lightweight Profiling Framework to Profile Program Context and System-Level Metrics
conference, May 2016
- Marathe, Aniruddha; Gahvari, Hormozd; Yeom, Jae-Seung
- 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
The ganglia distributed monitoring system: design, implementation, and experience
journal, July 2004
- Massie, Matthew L.; Chun, Brent N.; Culler, David E.
- Parallel Computing, Vol. 30, Issue 7
LIKWID Monitoring Stack: A Flexible Framework Enabling Job Specific Performance monitoring for the masses
conference, September 2017
- Rohl, Thomas; Eitzinger, Jan; Hager, Georg
- 2017 IEEE International Conference on Cluster Computing (CLUSTER)
Open | SpeedShop: An Open Source Infrastructure for Parallel Performance Analysis
journal, January 2008
- Schulz, Martin; Galarowicz, Jim; Maghrak, Don
- Scientific Programming, Vol. 16, Issue 2-3
GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility
conference, November 2017
- Vazhkudai, Sudharshan S.; Miller, Ross; Tiwari, Devesh
- SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis