Distributed non-negative matrix factorization with determination of the number of latent features
Abstract
The holistic analysis and understanding of the latent (that is, not directly observable) variables and patterns buried in large datasets is crucial for data-driven science, decision making and emergency response. Such exploratory analyses require devising unsupervised learning methods for data mining and extraction of the latent features, and non-negative matrix factorization (NMF) is one of the prominent such methods. NMF is based on compute-intense non-convex constrained minimization, which, for large datasets requires fast and distributed algorithms. However, current parallel implementations of NMF fail to estimate the number of latent features. In practice, identifying these features is both difficult and significant for pattern recognition and latent feature analysis, especially for large dense matrices. Here, we introduce a distributed NMF algorithm coupled with distributed custom clustering followed by a stability analysis on dense data, which we call DnMFk, to determine the number of latent variables. The results on synthetic data and the classical Swimmer data set demonstrate the accuracy of model determination while scaling nearly linearly across multiple processors for large data. Further, we employ DnMFk to determine the number of hidden features from a terabyte matrix.
- Authors:
-
- Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
- Publication Date:
- Research Org.:
- Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
- Sponsoring Org.:
- USDOE National Nuclear Security Administration (NNSA); USDOE Laboratory Directed Research and Development (LDRD) Program
- OSTI Identifier:
- 1688789
- Report Number(s):
- LA-UR-20-20469
Journal ID: ISSN 0920-8542
- Grant/Contract Number:
- 89233218CNA000001; AC52-06NA25396; 20190020DR
- Resource Type:
- Accepted Manuscript
- Journal Name:
- Journal of Supercomputing
- Additional Journal Information:
- Journal Volume: 76; Journal Issue: 9; Journal ID: ISSN 0920-8542
- Publisher:
- Springer
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 97 MATHEMATICS AND COMPUTING; NMF; latent features; distributed processing; clustering; parallel programming; silhouette; big data
Citation Formats
Chennupati, Gopinath, Vangara, Raviteja, Skau, Erik West, Djidjev, Hristo Nikolov, and Alexandrov, Boian. Distributed non-negative matrix factorization with determination of the number of latent features. United States: N. p., 2020.
Web. doi:10.1007/s11227-020-03181-6.
Chennupati, Gopinath, Vangara, Raviteja, Skau, Erik West, Djidjev, Hristo Nikolov, & Alexandrov, Boian. Distributed non-negative matrix factorization with determination of the number of latent features. United States. https://doi.org/10.1007/s11227-020-03181-6
Chennupati, Gopinath, Vangara, Raviteja, Skau, Erik West, Djidjev, Hristo Nikolov, and Alexandrov, Boian. Sat .
"Distributed non-negative matrix factorization with determination of the number of latent features". United States. https://doi.org/10.1007/s11227-020-03181-6. https://www.osti.gov/servlets/purl/1688789.
@article{osti_1688789,
title = {Distributed non-negative matrix factorization with determination of the number of latent features},
author = {Chennupati, Gopinath and Vangara, Raviteja and Skau, Erik West and Djidjev, Hristo Nikolov and Alexandrov, Boian},
abstractNote = {The holistic analysis and understanding of the latent (that is, not directly observable) variables and patterns buried in large datasets is crucial for data-driven science, decision making and emergency response. Such exploratory analyses require devising unsupervised learning methods for data mining and extraction of the latent features, and non-negative matrix factorization (NMF) is one of the prominent such methods. NMF is based on compute-intense non-convex constrained minimization, which, for large datasets requires fast and distributed algorithms. However, current parallel implementations of NMF fail to estimate the number of latent features. In practice, identifying these features is both difficult and significant for pattern recognition and latent feature analysis, especially for large dense matrices. Here, we introduce a distributed NMF algorithm coupled with distributed custom clustering followed by a stability analysis on dense data, which we call DnMFk, to determine the number of latent variables. The results on synthetic data and the classical Swimmer data set demonstrate the accuracy of model determination while scaling nearly linearly across multiple processors for large data. Further, we employ DnMFk to determine the number of hidden features from a terabyte matrix.},
doi = {10.1007/s11227-020-03181-6},
journal = {Journal of Supercomputing},
number = 9,
volume = 76,
place = {United States},
year = {Sat Feb 08 00:00:00 EST 2020},
month = {Sat Feb 08 00:00:00 EST 2020}
}
Works referenced in this record:
CloudNMF: A MapReduce Implementation of Nonnegative Matrix Factorization for Large-scale Biological Datasets
journal, February 2014
- Liao, Ruiqi; Zhang, Yifan; Guan, Jihong
- Genomics, Proteomics & Bioinformatics, Vol. 12, Issue 1
Metagenes and molecular pattern discovery using matrix factorization
journal, March 2004
- Brunet, J. -P.; Tamayo, P.; Golub, T. R.
- Proceedings of the National Academy of Sciences, Vol. 101, Issue 12
A high-performance parallel algorithm for nonnegative matrix factorization
conference, January 2016
- Kannan, Ramakrishnan; Ballard, Grey; Park, Haesun
- Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '16
Fixed points of the EM algorithm and nonnegative rank boundaries
journal, February 2015
- Kubjas, Kaie; Robeva, Elina; Sturmfels, Bernd
- The Annals of Statistics, Vol. 43, Issue 1
Statistical Inference, Learning and Models in Big Data
journal, June 2016
- Franke, Beate; Plante, Jean‐François; Roscher, Ribana
- International Statistical Review, Vol. 84, Issue 3
F lexi F a CT: Scalable Flexible Factorization of Coupled Tensors on Hadoop
conference, April 2014
- Beutel, Alex; Talukdar, Partha Pratim; Kumar, Abhimanu
- Proceedings of the 2014 SIAM International Conference on Data Mining
nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware
journal, January 2016
- Koitka, Sven; Friedrich, Christoph,M.
- The R Journal, Vol. 8, Issue 2
"General Intelligence," Objectively Determined and Measured
journal, April 1904
- Spearman, C.
- The American Journal of Psychology, Vol. 15, Issue 2
Fast Nonnegative Matrix Factorization: An Active-Set-Like Method and Comparisons
journal, January 2011
- Kim, Jingu; Park, Haesun
- SIAM Journal on Scientific Computing, Vol. 33, Issue 6
Signatures of mutational processes in human cancer
journal, August 2013
- Alexandrov, Ludmil B.; Nik-Zainal, Serena; Wedge, David C.
- Nature, Vol. 500, Issue 7463
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis
journal, November 1987
- Rousseeuw, Peter J.
- Journal of Computational and Applied Mathematics, Vol. 20
Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks
journal, April 2016
- Wu, Siqi; Joseph, Antony; Hammonds, Ann S.
- Proceedings of the National Academy of Sciences, Vol. 113, Issue 16
Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values
journal, June 1994
- Paatero, Pentti; Tapper, Unto
- Environmetrics, Vol. 5, Issue 2
Collective communication: theory, practice, and experience
journal, January 2007
- Chan, Ernie; Heimlich, Marcel; Purkayastha, Avi
- Concurrency and Computation: Practice and Experience, Vol. 19, Issue 13
A Flexible and Efficient Algorithmic Framework for Constrained Matrix and Tensor Factorization
journal, October 2016
- Huang, Kejun; Sidiropoulos, Nicholas D.; Liavas, Athanasios P.
- IEEE Transactions on Signal Processing, Vol. 64, Issue 19
Statistical Inference for Probabilistic Functions of Finite State Markov Chains
journal, December 1966
- Baum, Leonard E.; Petrie, Ted
- The Annals of Mathematical Statistics, Vol. 37, Issue 6
Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-Way Data Analysis and Blind Source Separation
book, September 2009
- Cichocki, Andrzej; Zdunek, Rafal; Phan, Anh Huy
Contaminant source identification using semi-supervised machine learning
journal, May 2018
- Vesselinov, Velimir V.; Alexandrov, Boian S.; O’Malley, Daniel
- Journal of Contaminant Hydrology, Vol. 212
Theorems on Positive Data: On the Uniqueness of NMF
journal, January 2008
- Laurberg, Hans; Christensen, Mads Græsbøll; Plumbley, Mark D.
- Computational Intelligence and Neuroscience, Vol. 2008
On principal component analysis, cosine and Euclidean measures in information retrieval
journal, November 2007
- Korenius, Tuomo; Laurikkala, Jorma; Juhola, Martti
- Information Sciences, Vol. 177, Issue 22
Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce
conference, January 2010
- Liu, Chao; Yang, Hung-chih; Fan, Jinliang
- Proceedings of the 19th international conference on World wide web - WWW '10
Nonnegative tensor decomposition with custom clustering for microphase separation of block copolymers
journal, February 2019
- Alexandrov, Boian S.; Stanev, Valentin G.; Vesselinov, Velimir V.
- Statistical Analysis and Data Mining: The ASA Data Science Journal, Vol. 12, Issue 4
Pearson Correlation Coefficient
book, January 2009
- Benesty, Jacob; Chen, Jingdong; Huang, Yiteng
- Noise Reduction in Speech Processing
Deciphering Signatures of Mutational Processes Operative in Human Cancer
journal, January 2013
- Alexandrov, Ludmil B.; Nik-Zainal, Serena; Wedge, David C.
- Cell Reports, Vol. 3, Issue 1
Biclustering and classification analysis in gene expression using Nonnegative Matrix Factorization on multi-GPU systems
conference, November 2011
- Mejia-Roa, E.; Garcia, C.; Gomez, J. I.
- 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA)
Unsupervised phase mapping of X-ray diffraction data by nonnegative matrix factorization integrated with custom clustering
journal, August 2018
- Stanev, Valentin; Vesselinov, Velimir V.; Kusne, A. Gilad
- npj Computational Materials, Vol. 4, Issue 1
Armadillo: a template-based C++ library for linear algebra
journal, June 2016
- Sanderson, Conrad; Curtin, Ryan
- The Journal of Open Source Software, Vol. 1, Issue 2
NMF-mGPU: non-negative matrix factorization on multi-GPU systems
journal, February 2015
- Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier
- BMC Bioinformatics, Vol. 16, Issue 1
Parallel Nonnegative Matrix Factorization Algorithm on the Distributed Memory Platform
journal, September 2009
- Dong, Chao; Zhao, Huijie; Wang, Wei
- International Journal of Parallel Programming, Vol. 38, Issue 2
Large-scale matrix factorization with distributed stochastic gradient descent
conference, January 2011
- Gemulla, Rainer; Nijkamp, Erik; Haas, Peter J.
- Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '11
Blind source separation for groundwater pressure analysis based on nonnegative matrix factorization
journal, September 2014
- Alexandrov, Boian S.; Vesselinov, Velimir V.
- Water Resources Research, Vol. 50, Issue 9
Learning the parts of objects by non-negative matrix factorization
journal, October 1999
- Lee, Daniel D.; Seung, H. Sebastian
- Nature, Vol. 401, Issue 6755
Behavioral clusters in dynamic graphs
journal, August 2015
- Fairbanks, James P.; Kannan, Ramakrishnan; Park, Haesun
- Parallel Computing, Vol. 47
Toward Faster Nonnegative Matrix Factorization: A New Algorithm and Comparisons
conference, December 2008
- Kim, Jingu; Park, Haesun
- 2008 Eighth IEEE International Conference on Data Mining (ICDM)
Performance Optimization of Multi-Core Grammatical Evolution Generated Parallel Recursive Programs
conference, January 2015
- Chennupati, Gopinath; Azad, R. Muhammad Atif; Ryan, Conor
- Proceedings of the 2015 on Genetic and Evolutionary Computation Conference - GECCO '15
Identification of release sources in advection–diffusion system by machine learning combined with Green’s function inverse method
journal, August 2018
- Stanev, Valentin G.; Iliev, Filip L.; Hansen, Scott
- Applied Mathematical Modelling, Vol. 60
Non-negative Matrix Factorization Implementation Using Graphic Processing Units
book, January 2010
- Lopes, Noel; Ribeiro, Bernardete
- Intelligent Data Engineering and Automated Learning – IDEAL 2010
A stable approach for model order selection in nonnegative matrix factorization
journal, March 2015
- Sun, Meng; Zhang, Xiongwei; Van hamme, Hugo
- Pattern Recognition Letters, Vol. 54
Nonnegative Matrix Factorization for identification of unknown number of sources emitting delayed signals
journal, March 2018
- Iliev, Filip L.; Stanev, Valentin G.; Vesselinov, Velimir V.
- PLOS ONE, Vol. 13, Issue 3
PLS-regression: a basic tool of chemometrics
journal, October 2001
- Wold, Svante; Sjöström, Michael; Eriksson, Lennart
- Chemometrics and Intelligent Laboratory Systems, Vol. 58, Issue 2
A high-performance parallel algorithm for nonnegative matrix factorization
journal, February 2016
- Kannan, Ramakrishnan; Ballard, Grey; Park, Haesun
- ACM SIGPLAN Notices, Vol. 51, Issue 8
FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop
text, January 2014
- Beutel, Alex; Kumar, Abhimanu; Papalexakis, Evangelos E.
- Carnegie Mellon University
Unsupervised Learning: Foundations of Neural Computation
January 1999
- Hinton, Geoffrey; Sejnowski, Terrence J.
- The MIT Press
Towards clinically actionable digital phenotyping targets in schizophrenia
journal, May 2020
- Henson, Philip; Barnett, Ian; Keshavan, Matcheri
- npj Schizophrenia, Vol. 6, Issue 1
Deciphering signatures of mutational processes operative in human cancer.
text, January 2013
- Alexandrov, Ludmil B.; Nik-Zainal Abidin, Serena; Wedge, David C.
- Apollo - University of Cambridge Repository
General Intelligence, Objectively Determined and Measured.
journal, January 1905
- Wright, Wm R.
- Psychological Bulletin, Vol. 2, Issue 4
Signatures of mutational processes in human cancer.
text, January 2013
- Alexandrov, Ludmil B.; Nik-Zainal Abidin, Serena; Wedge, David C.
- Apollo - University of Cambridge Repository
Statistical Inference, Learning and Models in Big Data
text, January 2015
- Franke, Beate; Plante, Jean-François; Roscher, Ribana
- arXiv
A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization
preprint, January 2015
- Kannan, Ramakrishnan; Ballard, Grey; Park, Haesun
- arXiv