Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications

Conference ·
Graphics Processing Units (GPUs), the dominantly adopted accelerators in HPC systems, are susceptible to transient hardware fault. New generation of GPUs feature mixed-precision architectures such as NVIDIA Tensor Cores to accelerate matrix multiplications. While being widely adapted, how would they behave under transient hardware faults remain unclear. In this study, we conduct a large-scale fault injection experiments on GEMM kernels implemented with different floating-point data types on the V100 and A100 Tensor Cores, and show distinct error resilience characteristics for the GEMMS with different formats. In the future, we plan to explore this space by building precision-aware floating-point fault tolerance techniques for applications such as DNNs that exercise low-precision computations.
Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
1963399
Report Number(s):
PNNL-SA-177005
Country of Publication:
United States
Language:
English

Similar Records

Understanding Mixed Precision GEMM with MPGemmFI: Insights into Fault Resilience
Conference · Sat Sep 28 00:00:00 EDT 2024 · OSTI ID:2479157

Matrix Product (GEMM) Performance Data from GPUs
Dataset · Thu Sep 09 00:00:00 EDT 2021 · OSTI ID:1819195

Throughput-Oriented and Accuracy-Aware DNN Training with BFloat16 on GPU
Conference · Fri Dec 31 23:00:00 EST 2021 · OSTI ID:1888020

Related Subjects