Distributed out-of-memory NMF on CPU/GPU architectures

Boureima, Ismael; Bhattarai, Manish; Eren, Maksim; Skau, Erik; Romero, Philip; Eidenbenz, Stephan; Alexandrov, Boian

doi:10.1007/s11227-023-05587-4

Distributed out-of-memory NMF on CPU/GPU architectures

Journal Article · Fri Sep 08 00:00:00 EDT 2023 · Journal of Supercomputing

DOI:https://doi.org/10.1007/s11227-023-05587-4· OSTI ID:2246858

Boureima, Ismael ^[1]; Bhattarai, Manish ^[1]; Eren, Maksim ^[1]; Skau, Erik ^[2]; Romero, Philip ^[3]; Eidenbenz, Stephan ^[2]; Alexandrov, Boian ^[1]

Los Alamos National Laboratory (LANL), Los Alamos, NM (United States). Theoretical Division
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States). Computer, Computational, and Statistical Science Division
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States). HPC Division

We propose an efficient distributed out-of-memory implementation of the non-negative matrix factorization (NMF) algorithm for heterogeneous high-performance-computing systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/output latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10^-6.

View Accepted Manuscript (DOE)

Research Organization:: Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)

Sponsoring Organization:: USDOE Laboratory Directed Research and Development (LDRD) Program; USDOE National Nuclear Security Administration (NNSA), Office of Defense Nuclear Nonproliferation

Grant/Contract Number:: 89233218CNA000001; AC52-06NA25396

OSTI ID:: 2246858

Report Number(s):: LA-UR--23-33139

Journal Information:: Journal of Supercomputing, Journal Name: Journal of Supercomputing Vol. 80; ISSN 0920-8542

Publisher:: SpringerCopyright Statement

Country of Publication:: United States

Language:: English

References (39)

pyDRESCALk: Python Distributed Non Negative RESCAL Decomposition with Determination of Latent Features Bhattarai, Manish; Kharat, Namita; Skau, Erik https://doi.org/10.5281/zenodo.5758446	software	December 2021
Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation Cichocki, Andrzej; Zdunek, Rafal; Phan, Anh Huy https://doi.org/10.1002/9780470747278	book	September 2009
Nonnegative tensor decomposition with custom clustering for microphase separation of block copolymers Alexandrov, Boian S.; Stanev, Valentin G.; Vesselinov, Velimir V. Statistical Analysis and Data Mining: The ASA Data Science Journal, Vol. 12, Issue 4 https://doi.org/10.1002/sam.11407	journal	February 2019
Fast Nonnegative Tensor Factorization with an Active-Set-Like Method Kim, Jingu; Park, Haesun High-Performance Scientific Computing https://doi.org/10.1007/978-1-4471-2437-5_16	book	January 2012
Non-negative Matrix Factorization Implementation Using Graphic Processing Units Lopes, Noel; Ribeiro, Bernardete Intelligent Data Engineering and Automated Learning – IDEAL 2010 https://doi.org/10.1007/978-3-642-15381-5_34	book	January 2010
Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework Kim, Jingu; He, Yunlong; Park, Haesun Journal of Global Optimization, Vol. 58, Issue 2 https://doi.org/10.1007/s10898-013-0035-4	journal	March 2013
Distributed non-negative matrix factorization with determination of the number of latent features Chennupati, Gopinath; Vangara, Raviteja; Skau, Erik The Journal of Supercomputing https://doi.org/10.1007/s11227-020-03181-6	journal	February 2020
Studies of Materials at the Nanometer Scale Using Coherent X-Ray Diffraction Imaging Sandberg, Richard L.; Huang, Zhifeng; Xu, Rui JOM, Vol. 65, Issue 9 https://doi.org/10.1007/s11837-013-0699-8	journal	August 2013
Deciphering Signatures of Mutational Processes Operative in Human Cancer Alexandrov, Ludmil B.; Nik-Zainal, Serena; Wedge, David C. Cell Reports, Vol. 3, Issue 1 https://doi.org/10.1016/j.celrep.2012.12.008	journal	January 2013
Behavioral clusters in dynamic graphs Fairbanks, James P.; Kannan, Ramakrishnan; Park, Haesun Parallel Computing, Vol. 47 https://doi.org/10.1016/j.parco.2015.03.002	journal	August 2015
An active learning reliability method combining Kriging constructed with exploration and exploitation of failure region and subset simulation Zhang, Jinhao; Xiao, Mi; Gao, Liang Reliability Engineering & System Safety, Vol. 188 https://doi.org/10.1016/j.ress.2019.03.002	journal	August 2019
Learning the parts of objects by non-negative matrix factorization Lee, Daniel D.; Seung, H. Sebastian Nature, Vol. 401, Issue 6755 https://doi.org/10.1038/44565	journal	October 1999
Signatures of mutational processes in human cancer Alexandrov, Ludmil B.; Nik-Zainal, Serena; Wedge, David C. Nature, Vol. 500, Issue 7463 https://doi.org/10.1038/nature12477	journal	August 2013
The repertoire of mutational signatures in human cancer Alexandrov, Ludmil B.; Kim, Jaegil; Haradhvala, Nicholas J. Nature, Vol. 578, Issue 7793 https://doi.org/10.1038/s41586-020-1943-3	journal	February 2020
Array programming with NumPy Harris, Charles R.; Millman, K. Jarrod; van der Walt, Stéfan J. Nature, Vol. 585, Issue 7825 https://doi.org/10.1038/s41586-020-2649-2	journal	September 2020
SciPy 1.0: fundamental algorithms for scientific computing in Python Virtanen, Pauli; Gommers, Ralf; Oliphant, Travis E. Nature Methods https://doi.org/10.1038/s41592-019-0686-2	journal	February 2020
Energy-free machine learning force field for aluminum Kruglov, Ivan; Sergeev, Oleg; Yanilkin, Alexey Scientific Reports, Vol. 7, Issue 1 https://doi.org/10.1038/s41598-017-08455-3	journal	August 2017
Machine learning of molecular properties: Locality and active learning Gubaev, Konstantin; Podryabinkin, Evgeny V.; Shapeev, Alexander V. The Journal of Chemical Physics, Vol. 148, Issue 24 https://doi.org/10.1063/1.5005095	journal	June 2018
Nanoflow electrospinning serial femtosecond crystallography Sierra, Raymond G.; Laksmono, Hartawan; Kern, Jan Acta Crystallographica Section D Biological Crystallography, Vol. 68, Issue 11, p. 1584-1587 https://doi.org/10.1107/S0907444912038152	journal	October 2012
Finding the Number of Latent Topics With Semantic Non-Negative Matrix Factorization Vangara, Raviteja; Bhattarai, Manish; Skau, Erik IEEE Access, Vol. 9 https://doi.org/10.1109/ACCESS.2021.3106879	journal	January 2021
Distributed Non-Negative Tensor Train Decomposition Bhattarai, Manish; Chennupati, Gopinath; Skau, Erik 2020 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC43674.2020.9286234	conference	September 2020
Distributed Out-of-Memory SVD on CPU/GPU Architectures Boureima, Ismael; Bhattarai, Manish; Eren, Maksim E. 2022 IEEE High Performance Extreme Computing Conference (HPEC) https://doi.org/10.1109/HPEC55821.2022.9926288	conference	September 2022
Semantic Nonnegative Matrix Factorization with Automatic Model Determination for Topic Modeling Vangara, Raviteja; Skau, Erik; Chennupati, Gopinath 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA) https://doi.org/10.1109/ICMLA51294.2020.00060	conference	December 2020
mpi4py: Status Update After 12 Years of Development Dalcin, Lisandro; Fang, Yao-Lung L. Computing in Science & Engineering, Vol. 23, Issue 4 https://doi.org/10.1109/MCSE.2021.3083216	journal	July 2021
Variational Nonnegative Matrix Factorisation Cemgil, A. Taylan 2009 IEEE 17th Signal Processing and Communications Applications Conference https://doi.org/10.1109/SIU.2009.5136487	conference	April 2009
Statistical Inference, Learning and Models in Big Data Franke, Beate; Plante, Jean‐François; Roscher, Ribana International Statistical Review, Vol. 84, Issue 3 https://doi.org/10.1111/insr.12176	journal	June 2016
Software for Sparse Tensor Decomposition on Emerging Computing Architectures Phipps, Eric T.; Kolda, Tamara G. SIAM Journal on Scientific Computing, Vol. 41, Issue 3 https://doi.org/10.1137/18M1210691	journal	January 2019
Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Awan, A. A.; Hamidouche, K.; Venkatesh, A. Proceedings of the 23rd European MPI Users' Group Meeting https://doi.org/10.1145/2966884.2966912	conference	September 2016
A high-performance parallel algorithm for nonnegative matrix factorization Kannan, Ramakrishnan; Ballard, Grey; Park, Haesun ACM SIGPLAN Notices, Vol. 51, Issue 8 https://doi.org/10.1145/3016078.2851152	journal	February 2016
ALO-NMF: Accelerated Locality-Optimized Non-negative Matrix Factorization Moon, Gordon E.; Ellis, J. Austin; Sukumaran-Rajam, Aravind Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining https://doi.org/10.1145/3394486.3403227	conference	August 2020
Planc Eswar, Srinivas; Hayashi, Koby; Ballard, Grey ACM Transactions on Mathematical Software, Vol. 47, Issue 3 https://doi.org/10.1145/3432185	journal	June 2021
General-purpose Unsupervised Cyber Anomaly Detection via Non-negative Tensor Factorization Eren, Maksim E.; Moore, Juston S.; Skau, Erik Digital Threats: Research and Practice, Vol. 4, Issue 1 https://doi.org/10.1145/3519602	journal	March 2023
SeNMFk-SPLIT Eren, Maksim E.; Solovyev, Nick; Bhattarai, Manish Proceedings of the 22nd ACM Symposium on Document Engineering https://doi.org/10.1145/3558100.3563844	conference	September 2022
Collaborative Filtering Recommendation Using Nonnegative Matrix Factorization in GPU-Accelerated Spark Platform Tang, Bing; Kang, Linyao; Zhang, Li Scientific Programming, Vol. 2021 https://doi.org/10.1155/2021/8841133	journal	January 2021
NMF-mGPU: non-negative matrix factorization on multi-GPU systems Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier BMC Bioinformatics, Vol. 16, Issue 1 https://doi.org/10.1186/s12859-015-0485-4	journal	February 2015
“Data is the new oil”: citizen science and informed consent in an era of researchers handling of an economically valuable resource Quigley, Etain; Holme, Ingrid; Doyle, David M. Life Sciences, Society and Policy, Vol. 17, Issue 1 https://doi.org/10.1186/s40504-021-00118-6	journal	December 2021
Machine learning and LHC event generation Butter, Anja; Plehn, Tilman; Schumann, Steffen SciPost Physics, Vol. 14, Issue 4 https://doi.org/10.21468/SciPostPhys.14.4.079	journal	April 2023
Selection of Optimal Salient Time Steps by Non-negative Tucker Tensor Decomposition Pulido, Jesus; Patchett, John; Bhattarai, Manish The Eurographics Association https://doi.org/10.2312/evs.20211055	text	January 2021
nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware Koitka, Sven; Friedrich, Christoph,M. The R Journal, Vol. 8, Issue 2 https://doi.org/10.32614/RJ-2016-053	journal	January 2016

Similar Records

High performance sparse multifrontal solvers on modern GPUs

Journal Article · Fri Feb 04 19:00:00 EST 2022 · Parallel Computing · OSTI ID:1960514

A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

Journal Article · Sun Aug 18 20:00:00 EDT 2019 · Journal of Parallel and Distributed Computing · OSTI ID:1559632

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments

Technical Report · Mon Apr 02 00:00:00 EDT 2018 · OSTI ID:1435688

Related Subjects

97 MATHEMATICS AND COMPUTING
CUDA
Cupy
GPU
Latent features
Mathematics
NCCL
NMF
distributed processing
out of memory
parallel programming

Distributed out-of-memory NMF on CPU/GPU architectures

Citation Formats

References (39)

Similar Records

Related Subjects