Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications
- BATTELLE (PACIFIC NW LAB)
- NVIDIA
- University of Utah
- Lawrence Livermore National Laboratory
Graphics Processing Units (GPUs), the dominantly adopted accelerators in HPC systems, are susceptible to transient hardware fault. New generation of GPUs feature mixed-precision architectures such as NVIDIA Tensor Cores to accelerate matrix multiplications. While being widely adapted, how would they behave under transient hardware faults remain unclear. In this study, we conduct a large-scale fault injection experiments on GEMM kernels implemented with different floating-point data types on the V100 and A100 Tensor Cores, and show distinct error resilience characteristics for the GEMMS with different formats. In the future, we plan to explore this space by building precision-aware floating-point fault tolerance techniques for applications such as DNNs that exercise low-precision computations.
- Research Organization:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-76RL01830
- OSTI ID:
- 1963399
- Report Number(s):
- PNNL-SA-177005
- Country of Publication:
- United States
- Language:
- English
Similar Records
Understanding Mixed Precision GEMM with MPGemmFI: Insights into Fault Resilience
Matrix Product (GEMM) Performance Data from GPUs
Throughput-Oriented and Accuracy-Aware DNN Training with BFloat16 on GPU
Conference
·
Sat Sep 28 00:00:00 EDT 2024
·
OSTI ID:2479157
Matrix Product (GEMM) Performance Data from GPUs
Dataset
·
Thu Sep 09 00:00:00 EDT 2021
·
OSTI ID:1819195
Throughput-Oriented and Accuracy-Aware DNN Training with BFloat16 on GPU
Conference
·
Fri Dec 31 23:00:00 EST 2021
·
OSTI ID:1888020