Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Climbing the Summit and Pushing the Frontier of Mixed Precision Benchmarks at Extreme Scale

Conference ·

The rise of machine learning (ML) applications and their use of mixed precision to perform interesting science are driving forces behind AI for science on HPC. The convergence of ML and HPC with mixed precision offers the possibility of transformational changes in computational science. The HPL-AI benchmark is designed to measure the performance of mixed precision arithmetic as opposed to the HPL benchmark which measures double precision performance. Pushing the limits of systems at extreme scale is nontrivial -little public literature explores optimization of mixed precision computations at this scale. In this work, we demonstrate how to scale up the HPL-AI benchmark on the pre-exascale Summit and exascale Frontier systems at the Oak Ridge Leadership Computing Facility (OLCF) with a cross-platform design. We present the implementation, performance results, and a guideline of optimization strategies employed for delivering portable performance on both AMD and NVIDIA GPUs at extreme scale.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE; USDOE Office of Science (SC)
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1997799
Resource Relation:
Conference: International Conference for High Performance Computing, Networking, Storage and Analysis (SC22) - Dallas TX, Texas, United States of America - 11/13/2022 10:00:00 AM-11/18/2022 10:00:00 AM
Country of Publication:
United States
Language:
English

References (17)

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers November 2018
Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems November 2020
Distributed-Memory Parallel Symmetric Nonnegative Matrix Factorization November 2020
A Distributed Newton Method for Network Utility Maximization–I: Algorithm September 2013
Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster June 2010
Implementation and Numerical Techniques for One EFlop/s HPL-AI Benchmark on Fugaku November 2020
SLATE: design of a modern distributed and accelerated linear algebra library
  • No authors listed
  • SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/3295500.3356223
November 2019
Optimization of Collective Communication Operations in MPICH February 2005
The LINPACK Benchmark: past, present and future January 2003
New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations July 2013
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis August 2019
AORSA full wave calculations of helicon waves in DIII-D and ITER April 2018
A massively parallel tensor contraction framework for coupled-cluster computations December 2014
High Performance Dense Linear System Solver with Resilience to Multiple Soft Errors January 2012
Parallel Algorithms for Dense Linear Algebra Computations March 1990
Block Algorithms for Parallel Machines January 1988
Accuracy and Stability of Numerical Algorithms January 2002