Production Application Performance Data Streaming for System Monitoring

Izadpanah, Ramin; Allan, Benjamin A.; Dechev, Damian; Brandt, Jim

doi:10.1145/3319498

Production Application Performance Data Streaming for System Monitoring

Journal Article · Sat Jun 01 00:00:00 EDT 2019 · ACM Transactions on Modeling and Performance Evaluation of Computing Systems

DOI:https://doi.org/10.1145/3319498· OSTI ID:1570222

Izadpanah, Ramin ^[1]; Allan, Benjamin A. ^[2]; Dechev, Damian ^[1]; Brandt, Jim ^[2]

Univ. of Central Florida, Orlando, FL (United States)
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

In this article, we present an approach to streaming collection of application performance data. Practical application performance tuning and troubleshooting in production high-performance computing (HPC) environments requires an understanding of how applications interact with the platform, including (but not limited to) parallel programming libraries such as Message Passing Interface (MPI). Several profiling and tracing tools exist that collect heavy runtime data traces either in memory (released only at application exit) or on a file system (imposing an I/O load that may interfere with the performance being measured). Although these approaches are beneficial in development stages and post-run analysis, a systemwide and low-overhead method is required to monitor deployed applications continuously. This method must be able to collect information at both the application and system levels to yield a complete performance picture. In our approach, an application profiler collects application event counters. A sampler uses an efficient inter-process communication method to periodically extract the application counters and stream them into an infrastructure for performance data collection. We implement a tool-set based on our approach and integrate it with the Lightweight Distributed Metric Service (LDMS) system, a monitoring system used on large-scale computational platforms. LDMS provides the infrastructure to create and gather streams of performance data in a low overhead manner. We demonstrate our approach using applications implemented with MPI, as it is one of the most common standards for the development of large-scale scientific applications. We utilize our tool-set to study the impact of our approach on an open source HPC application, Nalu. Our tool-set enables us to efficiently identify patterns in the behavior of the application without source-level knowledge. We leverage LDMS to collect system-level performance data and explore the correlation between the system and application events. Finally, we demonstrate how our tool-set can help detect anomalies with a low latency. We run tests on two different architectures: a system enabled with Intel Xeon Phi and another system equipped with Intel Xeon processor. Our overhead study shows our method imposes at most 0.5% CPU usage overhead on the application in realistic deployment scenarios.

View Accepted Manuscript (DOE)

Research Organization:: Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States); Sandia National Laboratories (SNL-CA), Livermore, CA (United States)

Sponsoring Organization:: USDOE National Nuclear Security Administration (NNSA)

Grant/Contract Number:: AC04-94AL85000

OSTI ID:: 1570222

Report Number(s):: SAND2019--8609J; 677802

Journal Information:: ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Journal Name: ACM Transactions on Modeling and Performance Evaluation of Computing Systems Journal Issue: 2 Vol. 4; ISSN 2376-3639

Publisher:: Association for Computing MachineryCopyright Statement

Country of Publication:: United States

Language:: English

References (21)

The Vampir Performance Analysis Tool-Set Knupfer, Andreas; Brunst, Holger; Doleschal, Jens Tools for High Performance Computing https://doi.org/10.1007/978-3-540-68564-7_9	book	January 2008
Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Knüpfer, Andreas; Rössel, Christian; Mey, Dieter an Tools for High Performance Computing 2011 https://doi.org/10.1007/978-3-642-31476-6_7	book	January 2012
Benchmarking the effects of operating system interference on extreme-scale parallel machines Beckman, Pete; Iskra, Kamil; Yoshii, Kazutomo Cluster Computing, Vol. 11, Issue 1 https://doi.org/10.1007/s10586-007-0047-2	journal	January 2008
The ganglia distributed monitoring system: design, implementation, and experience Massie, Matthew L.; Chun, Brent N.; Culler, David E. Parallel Computing, Vol. 30, Issue 7 https://doi.org/10.1016/j.parco.2004.04.001	journal	July 2004
Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems Agelastos, Anthony; Allan, Benjamin; Brandt, Jim Parallel Computing, Vol. 58 https://doi.org/10.1016/j.parco.2016.05.009	journal	October 2016
Extending LDMS to Enable Performance Monitoring in Multi-core Applications Feldman, Steven; Zhang, Deli; Dechev, Damian 2015 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2015.125	conference	September 2015
LIKWID Monitoring Stack: A Flexible Framework Enabling Job Specific Performance monitoring for the masses Rohl, Thomas; Eitzinger, Jan; Hager, Georg 2017 IEEE International Conference on Cluster Computing (CLUSTER) https://doi.org/10.1109/CLUSTER.2017.115	conference	September 2017
Reconciling Sampling and Direct Instrumentation for Unintrusive Call-Path Profiling of MPI Programs Szebenyi, Zolt´n; Gamblin, Todd; Schulz, Martin Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.67	conference	May 2011
Large-Scale Persistent Numerical Data Source Monitoring System Experiences Brandt, J.; Gentile, A.; Showerman, M. 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2016.188	conference	May 2016
LibPowerMon: A Lightweight Profiling Framework to Profile Program Context and System-Level Metrics Marathe, Aniruddha; Gahvari, Hormozd; Yeom, Jae-Seung 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://doi.org/10.1109/IPDPSW.2016.199	conference	May 2016
A Methodology for Performance Analysis of Non-blocking Algorithms Using Hardware and Software Metrics Izadpanah, Ramin; Feldman, Steven; Dechev, Damian 2016 IEEE 19th International Symposium on Real-Time Distributed Computing (ISORC) https://doi.org/10.1109/ISORC.2016.16	conference	May 2016
The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications Agelastos, Anthony; Allan, Benjamin; Brandt, Jim SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.18	conference	November 2014
There goes the neighborhood: performance degradation due to nearby jobs Bhatele, Abhinav; Mohror, Kathryn; Langer, Steven H. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503247	conference	January 2013
Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results Hoefler, Torsten; Belli, Roberto Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807644	conference	January 2015
Run-to-run variability on Xeon Phi based cray XC systems Chunduri, Sudheer; Harms, Kevin; Parker, Scott SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126926	conference	November 2017
GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility Vazhkudai, Sudharshan S.; Miller, Ross; Tiwari, Devesh SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126946	conference	November 2017
Integrating Low-latency Analysis into HPC System Monitoring Izadpanah, Ramin; Naksinehaboon, Nichamon; Brandt, Jim ICPP 2018: 47th International Conference on Parallel Processing, Proceedings of the 47th International Conference on Parallel Processing https://doi.org/10.1145/3225058.3225086	conference	August 2018
Gprof: A call graph execution profiler Graham, Susan L.; Kessler, Peter B.; Mckusick, Marshall K. ACM SIGPLAN Notices, Vol. 17, Issue 6 https://doi.org/10.1145/872726.806987	journal	June 1982
Open \| SpeedShop: An Open Source Infrastructure for Parallel Performance Analysis Schulz, Martin; Galarowicz, Jim; Maghrak, Don Scientific Programming, Vol. 16, Issue 2-3 https://doi.org/10.1155/2008/713705	journal	January 2008
A Portable Programming Interface for Performance Evaluation on Modern Processors Browne, S.; Dongarra, J.; Garner, N. The International Journal of High Performance Computing Applications, Vol. 14, Issue 3 https://doi.org/10.1177/109434200001400303	journal	August 2000
Observing the clouds: a survey and taxonomy of cloud monitoring Ward, Jonathan Stuart; Barker, Adam Journal of Cloud Computing, Vol. 3, Issue 1 https://doi.org/10.1186/s13677-014-0024-2	journal	December 2014

Similar Records

LDMS-GPU: Lightweight Distributed Metric Service (LDMS) for NVIDIA GPGPUs

Technical Report · Tue Sep 01 00:00:00 EDT 2020 · OSTI ID:1813665

A Cross-Platform Infrastructure for Scalable Runtime Application Performance Analysis

Technical Report · Mon Mar 14 23:00:00 EST 2005 · OSTI ID:841192

AUTOPERF

Software · Thu Apr 29 20:00:00 EDT 2021 · OSTI ID:code-62475

Related Subjects

59 BASIC BIOLOGICAL SCIENCES
application and system monitoring
application profiling
performance data streaming

Production Application Performance Data Streaming for System Monitoring

Citation Formats

References (21)

Similar Records

Related Subjects