Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

High Performance Computing Metrics to Enable Application-Platform Communication

Technical Report ·
DOI:https://doi.org/10.2172/1562429· OSTI ID:1562429
Sandia has invested heavily in scientific/engineering application development and in the research, development, and deployment of large scale HPC platforms to support the computational needs of these applications. As application developers continually expand the capabilities of their software and spend more time on performance tuning of applications for these platforms, HPC platform resources are at a premium as they are a heavily shared resource serving the varied needs of many users. To ensure that the HPC platform resources are being used effciently and perform as designed, it is necessary to obtain reliable data on resource utilization that will allow us to investigate the occurrence, severity, and causes of performance-affecting contention between applications. The work presented in this paper was an initial step to determine if resource contention can be understood and minimized through monitoring, modeling, planning and infrastructure. This paper describes the set of metric definitions, identified in this research, that can be used as meaningful and potentially actionable indicators of performance-affecting contention between applications. These metrics were verified using the observed slowdown of IOR, IMB, and CTH in operating scenarios that forced contention. This paper also describes system/application monitoring activities that are critical to distilling vast amounts of data into quantities that hold the key to understanding for an application's performance under production conditions and that will ultimately aid in Sandia's efforts to succeed in extreme-scale computing.
Research Organization:
Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States); Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA); USDOE Laboratory Directed Research and Development (LDRD) Program
DOE Contract Number:
AC04-94AL85000
OSTI ID:
1562429
Report Number(s):
SAND--2016-9525; 647700
Country of Publication:
United States
Language:
English

Similar Records

R&D100: Lightweight Distributed Metric Service
Multimedia · Wed Nov 18 23:00:00 EST 2015 · OSTI ID:1328737

Towards New Metrics for High-Performance Computing Resilience
Conference · Sat Dec 31 23:00:00 EST 2016 · OSTI ID:1360079

Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems
Journal Article · Tue May 17 20:00:00 EDT 2016 · Parallel Computing · OSTI ID:1263594

Related Subjects