skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Characterization and identification of HPC applications at leadership computing facility

Abstract

High Performance Computing (HPC) is an important method for scientific discovery via large-scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers are expensive, but essential to run large HPC applications. The Petascale era of supercomputers began in 2008, with the first machines achieving performance in excess of one petaflops, and with the advent of new supercomputers in 2021 (e.g., Aurora, Frontier), the Exascale era will soon begin. However, the high theoretical computing capability (i.e., peak FLOPS) of a machine is not the only meaningful target when designing a supercomputer, as the resources demand of applications varies. A deep understanding of the characterization of applications that run on a leadership supercomputer is one of the most important ways for planning its design, development and operation. In order to improve our understanding of HPC applications, user demands and resource usage characteristics, we perform correlative analysis of various logs for different subsystems of a leadership supercomputer. This analysis reveals surprising, sometimes counter-intuitive patterns, which, in some cases, conflicts with existing assumptions, and have important implications for future system designs as well as supercomputer operations. For example, our analysis shows that while the applications spend significant time on MPI, most applications spend very littlemore » time on file I/O. Combined analysis of hardware event logs and task failure logs show that the probability of a hardware FATAL event causing task failure is low. Combined analysis of control system logs and file I/O logs reveals that pure POSIX I/O is used more widely than higher level parallel I/O. Based on holistic insights of the application gained through combined and co-analysis of multiple logs from different perspectives and general intuition, we engineer features to "fingerprint" HPC applications. We use t-SNE (a machine learning technique for dimensionality reduction) to validate the explainability of our features and finally train machine learning models to identify HPC applications or group those with similar characteristic. To the best of our knowledge, this is the first work that combines logs on file I/O, computing, and inter-node communication for insightful analysis of HPC applications in production.« less

Authors:
ORCiD logo [1]; ORCiD logo [2];  [1];  [1];  [3];  [1];  [1];  [1]
  1. Argonne National Laboratory (ANL)
  2. ORNL
  3. Northern Illinois University
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1649007
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: ACM International Conference on Supercomputing (ICS20) - Virtual conference, Tennessee, United States of America - 6/29/2020 4:00:00 AM-7/2/2020 4:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Liu, Zhengchun, Rao, Nageswara S., Kettimuthu, Rajkumar, Foster, Ian, Lewis, Ryan, Harms, Kevin, Carns, Philip, and Papka, Michael. Characterization and identification of HPC applications at leadership computing facility. United States: N. p., 2020. Web.
Liu, Zhengchun, Rao, Nageswara S., Kettimuthu, Rajkumar, Foster, Ian, Lewis, Ryan, Harms, Kevin, Carns, Philip, & Papka, Michael. Characterization and identification of HPC applications at leadership computing facility. United States.
Liu, Zhengchun, Rao, Nageswara S., Kettimuthu, Rajkumar, Foster, Ian, Lewis, Ryan, Harms, Kevin, Carns, Philip, and Papka, Michael. 2020. "Characterization and identification of HPC applications at leadership computing facility". United States. https://www.osti.gov/servlets/purl/1649007.
@article{osti_1649007,
title = {Characterization and identification of HPC applications at leadership computing facility},
author = {Liu, Zhengchun and Rao, Nageswara S. and Kettimuthu, Rajkumar and Foster, Ian and Lewis, Ryan and Harms, Kevin and Carns, Philip and Papka, Michael},
abstractNote = {High Performance Computing (HPC) is an important method for scientific discovery via large-scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers are expensive, but essential to run large HPC applications. The Petascale era of supercomputers began in 2008, with the first machines achieving performance in excess of one petaflops, and with the advent of new supercomputers in 2021 (e.g., Aurora, Frontier), the Exascale era will soon begin. However, the high theoretical computing capability (i.e., peak FLOPS) of a machine is not the only meaningful target when designing a supercomputer, as the resources demand of applications varies. A deep understanding of the characterization of applications that run on a leadership supercomputer is one of the most important ways for planning its design, development and operation. In order to improve our understanding of HPC applications, user demands and resource usage characteristics, we perform correlative analysis of various logs for different subsystems of a leadership supercomputer. This analysis reveals surprising, sometimes counter-intuitive patterns, which, in some cases, conflicts with existing assumptions, and have important implications for future system designs as well as supercomputer operations. For example, our analysis shows that while the applications spend significant time on MPI, most applications spend very little time on file I/O. Combined analysis of hardware event logs and task failure logs show that the probability of a hardware FATAL event causing task failure is low. Combined analysis of control system logs and file I/O logs reveals that pure POSIX I/O is used more widely than higher level parallel I/O. Based on holistic insights of the application gained through combined and co-analysis of multiple logs from different perspectives and general intuition, we engineer features to "fingerprint" HPC applications. We use t-SNE (a machine learning technique for dimensionality reduction) to validate the explainability of our features and finally train machine learning models to identify HPC applications or group those with similar characteristic. To the best of our knowledge, this is the first work that combines logs on file I/O, computing, and inter-node communication for insightful analysis of HPC applications in production.},
doi = {},
url = {https://www.osti.gov/biblio/1649007}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2020},
month = {6}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: