Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems
Abstract
A detailed understanding of HPC applications’ resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Furthermore, our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.
- Authors:
-
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Publication Date:
- Research Org.:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Org.:
- USDOE National Nuclear Security Administration (NNSA)
- OSTI Identifier:
- 1263594
- Alternate Identifier(s):
- OSTI ID: 1397979
- Report Number(s):
- SAND-2016-3360J
Journal ID: ISSN 0167-8191; PII: S0167819116300394
- Grant/Contract Number:
- AC04-94AL85000
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Parallel Computing
- Additional Journal Information:
- Journal Name: Parallel Computing; Journal ID: ISSN 0167-8191
- Publisher:
- Elsevier
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; HPC monitoring; system profiling; application profiling; resource utilization scoring
Citation Formats
Agelastos, Anthony, Allan, Benjamin, Brandt, Jim, Gentile, Ann, Lefantzi, Sophia, Monk, Steve, Ogden, Jeff, Rajan, Mahesh, and Stevenson, Joel. Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems. United States: N. p., 2016.
Web. doi:10.1016/j.parco.2016.05.009.
Agelastos, Anthony, Allan, Benjamin, Brandt, Jim, Gentile, Ann, Lefantzi, Sophia, Monk, Steve, Ogden, Jeff, Rajan, Mahesh, & Stevenson, Joel. Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems. United States. https://doi.org/10.1016/j.parco.2016.05.009
Agelastos, Anthony, Allan, Benjamin, Brandt, Jim, Gentile, Ann, Lefantzi, Sophia, Monk, Steve, Ogden, Jeff, Rajan, Mahesh, and Stevenson, Joel. Wed .
"Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems". United States. https://doi.org/10.1016/j.parco.2016.05.009. https://www.osti.gov/servlets/purl/1263594.
@article{osti_1263594,
title = {Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems},
author = {Agelastos, Anthony and Allan, Benjamin and Brandt, Jim and Gentile, Ann and Lefantzi, Sophia and Monk, Steve and Ogden, Jeff and Rajan, Mahesh and Stevenson, Joel},
abstractNote = {A detailed understanding of HPC applications’ resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Furthermore, our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.},
doi = {10.1016/j.parco.2016.05.009},
journal = {Parallel Computing},
number = ,
volume = ,
place = {United States},
year = {Wed May 18 00:00:00 EDT 2016},
month = {Wed May 18 00:00:00 EDT 2016}
}
Web of Science