Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Production Application Performance Data Streaming for System Monitoring

Journal Article · · ACM Transactions on Modeling and Performance Evaluation of Computing Systems
DOI:https://doi.org/10.1145/3319498· OSTI ID:1570222
 [1];  [2];  [1];  [2]
  1. Univ. of Central Florida, Orlando, FL (United States)
  2. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

In this article, we present an approach to streaming collection of application performance data. Practical application performance tuning and troubleshooting in production high-performance computing (HPC) environments requires an understanding of how applications interact with the platform, including (but not limited to) parallel programming libraries such as Message Passing Interface (MPI). Several profiling and tracing tools exist that collect heavy runtime data traces either in memory (released only at application exit) or on a file system (imposing an I/O load that may interfere with the performance being measured). Although these approaches are beneficial in development stages and post-run analysis, a systemwide and low-overhead method is required to monitor deployed applications continuously. This method must be able to collect information at both the application and system levels to yield a complete performance picture. In our approach, an application profiler collects application event counters. A sampler uses an efficient inter-process communication method to periodically extract the application counters and stream them into an infrastructure for performance data collection. We implement a tool-set based on our approach and integrate it with the Lightweight Distributed Metric Service (LDMS) system, a monitoring system used on large-scale computational platforms. LDMS provides the infrastructure to create and gather streams of performance data in a low overhead manner. We demonstrate our approach using applications implemented with MPI, as it is one of the most common standards for the development of large-scale scientific applications. We utilize our tool-set to study the impact of our approach on an open source HPC application, Nalu. Our tool-set enables us to efficiently identify patterns in the behavior of the application without source-level knowledge. We leverage LDMS to collect system-level performance data and explore the correlation between the system and application events. Finally, we demonstrate how our tool-set can help detect anomalies with a low latency. We run tests on two different architectures: a system enabled with Intel Xeon Phi and another system equipped with Intel Xeon processor. Our overhead study shows our method imposes at most 0.5% CPU usage overhead on the application in realistic deployment scenarios.

Research Organization:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States); Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1570222
Report Number(s):
SAND2019--8609J; 677802
Journal Information:
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Journal Name: ACM Transactions on Modeling and Performance Evaluation of Computing Systems Journal Issue: 2 Vol. 4; ISSN 2376-3639
Publisher:
Association for Computing MachineryCopyright Statement
Country of Publication:
United States
Language:
English

References (21)

The Vampir Performance Analysis Tool-Set book January 2008
Score-P: A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir book January 2012
Benchmarking the effects of operating system interference on extreme-scale parallel machines journal January 2008
The ganglia distributed monitoring system: design, implementation, and experience journal July 2004
Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems journal October 2016
Extending LDMS to Enable Performance Monitoring in Multi-core Applications conference September 2015
LIKWID Monitoring Stack: A Flexible Framework Enabling Job Specific Performance monitoring for the masses conference September 2017
Reconciling Sampling and Direct Instrumentation for Unintrusive Call-Path Profiling of MPI Programs
  • Szebenyi, Zolt´n; Gamblin, Todd; Schulz, Martin
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.67
conference May 2011
Large-Scale Persistent Numerical Data Source Monitoring System Experiences conference May 2016
LibPowerMon: A Lightweight Profiling Framework to Profile Program Context and System-Level Metrics conference May 2016
A Methodology for Performance Analysis of Non-blocking Algorithms Using Hardware and Software Metrics conference May 2016
The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications
  • Agelastos, Anthony; Allan, Benjamin; Brandt, Jim
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2014.18
conference November 2014
There goes the neighborhood: performance degradation due to nearby jobs
  • Bhatele, Abhinav; Mohror, Kathryn; Langer, Steven H.
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503247
conference January 2013
Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results
  • Hoefler, Torsten; Belli, Roberto
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '15 https://doi.org/10.1145/2807591.2807644
conference January 2015
Run-to-run variability on Xeon Phi based cray XC systems
  • Chunduri, Sudheer; Harms, Kevin; Parker, Scott
  • SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126926
conference November 2017
GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility
  • Vazhkudai, Sudharshan S.; Miller, Ross; Tiwari, Devesh
  • SC '17: The International Conference for High Performance Computing, Networking, Storage and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3126908.3126946
conference November 2017
Integrating Low-latency Analysis into HPC System Monitoring
  • Izadpanah, Ramin; Naksinehaboon, Nichamon; Brandt, Jim
  • ICPP 2018: 47th International Conference on Parallel Processing, Proceedings of the 47th International Conference on Parallel Processing https://doi.org/10.1145/3225058.3225086
conference August 2018
Gprof: A call graph execution profiler journal June 1982
Open | SpeedShop: An Open Source Infrastructure for Parallel Performance Analysis journal January 2008
A Portable Programming Interface for Performance Evaluation on Modern Processors journal August 2000
Observing the clouds: a survey and taxonomy of cloud monitoring journal December 2014

Similar Records

LDMS-GPU: Lightweight Distributed Metric Service (LDMS) for NVIDIA GPGPUs
Technical Report · Tue Sep 01 00:00:00 EDT 2020 · OSTI ID:1813665

A Cross-Platform Infrastructure for Scalable Runtime Application Performance Analysis
Technical Report · Mon Mar 14 23:00:00 EST 2005 · OSTI ID:841192

AUTOPERF
Software · Thu Apr 29 20:00:00 EDT 2021 · OSTI ID:code-62475