Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Understanding Mixed Precision GEMM with MPGemmFI: Insights into Fault Resilience

Conference ·
 [1];  [2];  [2];  [3];  [4];  [4];  [5];  [6];  [2];  [7];  [1];  [1]
  1. BATTELLE (PACIFIC NW LAB)
  2. University of Utah
  3. Google
  4. NVIDIA
  5. Lawrence Livermore National Laboratory
  6. Indiana University-Bloomington
  7. University of British Columbia
Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). Thus, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significant performance, area, and memory footprint improvement. While promising, the mixed-precision computation on error resilience remains unexplored. To this end, we develop a fault injection framework that systematically injects fault into the mixed-precision computation results. We investigate how the faults affect the accuracy of machine learning applications. Based on the characteristics of error resilience, we offer lightweight error detection and correction solutions that significantly improve the overall model accuracy by 75% if the models experience hardware faults. The solutions can be efficiently integrated into the accelerator's pipelines.
Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-76RL01830
OSTI ID:
2479157
Report Number(s):
PNNL-SA-183954
Country of Publication:
United States
Language:
English

Similar Records

Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications
Conference · Sat Nov 12 23:00:00 EST 2022 · OSTI ID:1963399

Matrix Product (GEMM) Performance Data from GPUs
Dataset · Thu Sep 09 00:00:00 EDT 2021 · OSTI ID:1819195

Mixed-Precision S/DGEMM Using the TF32 and TF64 Frameworks on Low-Precision AI Tensor Cores
Conference · Wed Nov 01 00:00:00 EDT 2023 · OSTI ID:2438716

Related Subjects