skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Global Experiences with HPC Operational Data Measurement, Collection and Analysis

Abstract

As we move into the exascale era, supercomputers grow larger, denser, more heterogeneous, and ever more complex. Operating such machines reliably and efficiently requires deep insight into the operational parameters of the machine itself as well as its supporting infrastructure. To fulfill this need, early adopter sites have started the development and deployment of Operational Data Analytics (ODA) frameworks allowing the continuous monitoring, archiving, and analysis of near realtime performance data from the machine and infrastructure levels, providing immediately actionable information for multiple operational uses. To understand their ODA goals, requirements, and use cases, we have conducted a survey among eight early adopter sites from the US, Europe, and Japan that operate top 50 high-performance computing systems. We have assessed the technologies leveraged to build their ODA frameworks, identified use cases and other push and pull factors that drive the sites' ODA activities, and report on their operational lessons.

Authors:
 [1]; ORCiD logo [2];  [3];  [4];  [2];  [3];  [5]
  1. Leibniz Supercomputing Centre
  2. ORNL
  3. Lawrence Berkeley National Laboratory (LBNL)
  4. Hewlett Packard Enterprise
  5. Energy Efficient HPC Working Group
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
OSTI Identifier:
1706258
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: Energy Efficient HPC State of the Practice Workshop 2020 - Kobe, , Japan - 9/14/2020 4:00:00 AM-9/14/2020 4:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Ott, Michael, Shin, Woong, Bourassa, Norman J., Wilde, Tosten, Ceballos, Stefan, Romanus, Melissa, and Bates, Natalie. Global Experiences with HPC Operational Data Measurement, Collection and Analysis. United States: N. p., 2020. Web.
Ott, Michael, Shin, Woong, Bourassa, Norman J., Wilde, Tosten, Ceballos, Stefan, Romanus, Melissa, & Bates, Natalie. Global Experiences with HPC Operational Data Measurement, Collection and Analysis. United States.
Ott, Michael, Shin, Woong, Bourassa, Norman J., Wilde, Tosten, Ceballos, Stefan, Romanus, Melissa, and Bates, Natalie. Tue . "Global Experiences with HPC Operational Data Measurement, Collection and Analysis". United States. https://www.osti.gov/servlets/purl/1706258.
@article{osti_1706258,
title = {Global Experiences with HPC Operational Data Measurement, Collection and Analysis},
author = {Ott, Michael and Shin, Woong and Bourassa, Norman J. and Wilde, Tosten and Ceballos, Stefan and Romanus, Melissa and Bates, Natalie},
abstractNote = {As we move into the exascale era, supercomputers grow larger, denser, more heterogeneous, and ever more complex. Operating such machines reliably and efficiently requires deep insight into the operational parameters of the machine itself as well as its supporting infrastructure. To fulfill this need, early adopter sites have started the development and deployment of Operational Data Analytics (ODA) frameworks allowing the continuous monitoring, archiving, and analysis of near realtime performance data from the machine and infrastructure levels, providing immediately actionable information for multiple operational uses. To understand their ODA goals, requirements, and use cases, we have conducted a survey among eight early adopter sites from the US, Europe, and Japan that operate top 50 high-performance computing systems. We have assessed the technologies leveraged to build their ODA frameworks, identified use cases and other push and pull factors that drive the sites' ODA activities, and report on their operational lessons.},
doi = {},
url = {https://www.osti.gov/biblio/1706258}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2020},
month = {9}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: