Distributed out-of-memory NMF on CPU/GPU architectures
Journal Article
·
· Journal of Supercomputing
- Los Alamos National Laboratory (LANL), Los Alamos, NM (United States). Theoretical Division
- Los Alamos National Laboratory (LANL), Los Alamos, NM (United States). Computer, Computational, and Statistical Science Division
- Los Alamos National Laboratory (LANL), Los Alamos, NM (United States). HPC Division
We propose an efficient distributed out-of-memory implementation of the non-negative matrix factorization (NMF) algorithm for heterogeneous high-performance-computing systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/output latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10-6.
- Research Organization:
- Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
- Sponsoring Organization:
- USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE National Nuclear Security Administration (NNSA), Office of Defense Nuclear Nonproliferation
- Grant/Contract Number:
- 89233218CNA000001; AC52-06NA25396
- OSTI ID:
- 2246858
- Report Number(s):
- LA-UR--23-33139
- Journal Information:
- Journal of Supercomputing, Journal Name: Journal of Supercomputing Vol. 80; ISSN 0920-8542
- Publisher:
- SpringerCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Similar Records
High performance sparse multifrontal solvers on modern GPUs
A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems
Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments
Journal Article
·
Fri Feb 04 19:00:00 EST 2022
· Parallel Computing
·
OSTI ID:1960514
A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems
Journal Article
·
Sun Aug 18 20:00:00 EDT 2019
· Journal of Parallel and Distributed Computing
·
OSTI ID:1559632
Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments
Technical Report
·
Mon Apr 02 00:00:00 EDT 2018
·
OSTI ID:1435688