Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Understanding GPU Memory Corruption at Extreme Scale: The Summit Case Study

Conference ·
GPU memory corruption and in particular double-bit errors (DBEs) remain one of the least understood aspects of HPC system reliability. Albeit rare, their occurrences always lead to job termination and can potentially cost thousands of node-hours, either from wasted computations or as the overhead from regular checkpointing needed to minimize the losses. As supercomputers and their components simultaneously grow in scale, density, failure rates, and environmental footprint, the efficiency of HPC operations becomes both an imperative and a challenge. We examine DBEs using system telemetry data and logs collected from the Summit supercomputer, equipped with 27,648 Tesla V100 GPUs with 2nd-generation high-bandwidth memory (HBM2). Using exploratory data analysis and statistical learning, we extract several insights about memory reliability in such GPUs. We find that GPUs with prior DBE occurrences are prone to experience them again due to otherwise harmless factors, correlate this phenomenon with GPU placement, and suggest manufacturing variability as a factor. On the general population of GPUs, we link DBEs to short- and long-term high power consumption modes while finding no significant correlation with higher temperatures. We also show that the workload type can be a factor in memory’s propensity to corruption.
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21); USDOE
DOE Contract Number:
AC05-00OR22725;
OSTI ID:
2378092
Resource Type:
Conference paper/presentation
Conference Information:
ICS 2024: ACM International Conference on Supercomputing - Kyoto, Japan - 6/4/2024-6/7/2024
Country of Publication:
United States
Language:
English

Similar Records

OLCF Summit Supercomputer GPU Snapshots During Double-Bit Errors and Normal Operations
Dataset · Thu Apr 20 00:00:00 EDT 2023 · OSTI ID:1970187

GSoFa: Scalable Sparse Symbolic LU Factorization on GPUs
Journal Article · Thu Mar 31 20:00:00 EDT 2022 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1960228

Pre-exascale accelerated application development: The ORNL Summit experience
Journal Article · Thu Apr 30 20:00:00 EDT 2020 · IBM Journal of Research and Development · OSTI ID:1649509

Related Subjects