DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Distributed non-negative matrix factorization with determination of the number of latent features

Abstract

The holistic analysis and understanding of the latent (that is, not directly observable) variables and patterns buried in large datasets is crucial for data-driven science, decision making and emergency response. Such exploratory analyses require devising unsupervised learning methods for data mining and extraction of the latent features, and non-negative matrix factorization (NMF) is one of the prominent such methods. NMF is based on compute-intense non-convex constrained minimization, which, for large datasets requires fast and distributed algorithms. However, current parallel implementations of NMF fail to estimate the number of latent features. In practice, identifying these features is both difficult and significant for pattern recognition and latent feature analysis, especially for large dense matrices. Here, we introduce a distributed NMF algorithm coupled with distributed custom clustering followed by a stability analysis on dense data, which we call DnMFk, to determine the number of latent variables. The results on synthetic data and the classical Swimmer data set demonstrate the accuracy of model determination while scaling nearly linearly across multiple processors for large data. Further, we employ DnMFk to determine the number of hidden features from a terabyte matrix.

Authors:
ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]; ORCiD logo [1]
  1. Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
Publication Date:
Research Org.:
Los Alamos National Laboratory (LANL), Los Alamos, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA); USDOE Laboratory Directed Research and Development (LDRD) Program
OSTI Identifier:
1688789
Report Number(s):
LA-UR-20-20469
Journal ID: ISSN 0920-8542
Grant/Contract Number:  
89233218CNA000001; AC52-06NA25396; 20190020DR
Resource Type:
Accepted Manuscript
Journal Name:
Journal of Supercomputing
Additional Journal Information:
Journal Volume: 76; Journal Issue: 9; Journal ID: ISSN 0920-8542
Publisher:
Springer
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING; NMF; latent features; distributed processing; clustering; parallel programming; silhouette; big data

Citation Formats

Chennupati, Gopinath, Vangara, Raviteja, Skau, Erik West, Djidjev, Hristo Nikolov, and Alexandrov, Boian. Distributed non-negative matrix factorization with determination of the number of latent features. United States: N. p., 2020. Web. doi:10.1007/s11227-020-03181-6.
Chennupati, Gopinath, Vangara, Raviteja, Skau, Erik West, Djidjev, Hristo Nikolov, & Alexandrov, Boian. Distributed non-negative matrix factorization with determination of the number of latent features. United States. https://doi.org/10.1007/s11227-020-03181-6
Chennupati, Gopinath, Vangara, Raviteja, Skau, Erik West, Djidjev, Hristo Nikolov, and Alexandrov, Boian. Sat . "Distributed non-negative matrix factorization with determination of the number of latent features". United States. https://doi.org/10.1007/s11227-020-03181-6. https://www.osti.gov/servlets/purl/1688789.
@article{osti_1688789,
title = {Distributed non-negative matrix factorization with determination of the number of latent features},
author = {Chennupati, Gopinath and Vangara, Raviteja and Skau, Erik West and Djidjev, Hristo Nikolov and Alexandrov, Boian},
abstractNote = {The holistic analysis and understanding of the latent (that is, not directly observable) variables and patterns buried in large datasets is crucial for data-driven science, decision making and emergency response. Such exploratory analyses require devising unsupervised learning methods for data mining and extraction of the latent features, and non-negative matrix factorization (NMF) is one of the prominent such methods. NMF is based on compute-intense non-convex constrained minimization, which, for large datasets requires fast and distributed algorithms. However, current parallel implementations of NMF fail to estimate the number of latent features. In practice, identifying these features is both difficult and significant for pattern recognition and latent feature analysis, especially for large dense matrices. Here, we introduce a distributed NMF algorithm coupled with distributed custom clustering followed by a stability analysis on dense data, which we call DnMFk, to determine the number of latent variables. The results on synthetic data and the classical Swimmer data set demonstrate the accuracy of model determination while scaling nearly linearly across multiple processors for large data. Further, we employ DnMFk to determine the number of hidden features from a terabyte matrix.},
doi = {10.1007/s11227-020-03181-6},
journal = {Journal of Supercomputing},
number = 9,
volume = 76,
place = {United States},
year = {Sat Feb 08 00:00:00 EST 2020},
month = {Sat Feb 08 00:00:00 EST 2020}
}

Works referenced in this record:

CloudNMF: A MapReduce Implementation of Nonnegative Matrix Factorization for Large-scale Biological Datasets
journal, February 2014

  • Liao, Ruiqi; Zhang, Yifan; Guan, Jihong
  • Genomics, Proteomics & Bioinformatics, Vol. 12, Issue 1
  • DOI: 10.1016/j.gpb.2013.06.001

Metagenes and molecular pattern discovery using matrix factorization
journal, March 2004

  • Brunet, J. -P.; Tamayo, P.; Golub, T. R.
  • Proceedings of the National Academy of Sciences, Vol. 101, Issue 12
  • DOI: 10.1073/pnas.0308531101

A high-performance parallel algorithm for nonnegative matrix factorization
conference, January 2016

  • Kannan, Ramakrishnan; Ballard, Grey; Park, Haesun
  • Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '16
  • DOI: 10.1145/2851141.2851152

Fixed points of the EM algorithm and nonnegative rank boundaries
journal, February 2015

  • Kubjas, Kaie; Robeva, Elina; Sturmfels, Bernd
  • The Annals of Statistics, Vol. 43, Issue 1
  • DOI: 10.1214/14-AOS1282

Statistical Inference, Learning and Models in Big Data
journal, June 2016

  • Franke, Beate; Plante, Jean‐François; Roscher, Ribana
  • International Statistical Review, Vol. 84, Issue 3
  • DOI: 10.1111/insr.12176

F lexi F a CT: Scalable Flexible Factorization of Coupled Tensors on Hadoop
conference, April 2014

  • Beutel, Alex; Talukdar, Partha Pratim; Kumar, Abhimanu
  • Proceedings of the 2014 SIAM International Conference on Data Mining
  • DOI: 10.1137/1.9781611973440.13

nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware
journal, January 2016


"General Intelligence," Objectively Determined and Measured
journal, April 1904

  • Spearman, C.
  • The American Journal of Psychology, Vol. 15, Issue 2
  • DOI: 10.2307/1412107

Fast Nonnegative Matrix Factorization: An Active-Set-Like Method and Comparisons
journal, January 2011

  • Kim, Jingu; Park, Haesun
  • SIAM Journal on Scientific Computing, Vol. 33, Issue 6
  • DOI: 10.1137/110821172

Signatures of mutational processes in human cancer
journal, August 2013

  • Alexandrov, Ludmil B.; Nik-Zainal, Serena; Wedge, David C.
  • Nature, Vol. 500, Issue 7463
  • DOI: 10.1038/nature12477

Silhouettes: A graphical aid to the interpretation and validation of cluster analysis
journal, November 1987


Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks
journal, April 2016

  • Wu, Siqi; Joseph, Antony; Hammonds, Ann S.
  • Proceedings of the National Academy of Sciences, Vol. 113, Issue 16
  • DOI: 10.1073/pnas.1521171113

Collective communication: theory, practice, and experience
journal, January 2007

  • Chan, Ernie; Heimlich, Marcel; Purkayastha, Avi
  • Concurrency and Computation: Practice and Experience, Vol. 19, Issue 13
  • DOI: 10.1002/cpe.1206

A Flexible and Efficient Algorithmic Framework for Constrained Matrix and Tensor Factorization
journal, October 2016

  • Huang, Kejun; Sidiropoulos, Nicholas D.; Liavas, Athanasios P.
  • IEEE Transactions on Signal Processing, Vol. 64, Issue 19
  • DOI: 10.1109/TSP.2016.2576427

Statistical Inference for Probabilistic Functions of Finite State Markov Chains
journal, December 1966

  • Baum, Leonard E.; Petrie, Ted
  • The Annals of Mathematical Statistics, Vol. 37, Issue 6
  • DOI: 10.1214/aoms/1177699147

Contaminant source identification using semi-supervised machine learning
journal, May 2018


Theorems on Positive Data: On the Uniqueness of NMF
journal, January 2008

  • Laurberg, Hans; Christensen, Mads Græsbøll; Plumbley, Mark D.
  • Computational Intelligence and Neuroscience, Vol. 2008
  • DOI: 10.1155/2008/764206

On principal component analysis, cosine and Euclidean measures in information retrieval
journal, November 2007


Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce
conference, January 2010

  • Liu, Chao; Yang, Hung-chih; Fan, Jinliang
  • Proceedings of the 19th international conference on World wide web - WWW '10
  • DOI: 10.1145/1772690.1772760

Nonnegative tensor decomposition with custom clustering for microphase separation of block copolymers
journal, February 2019

  • Alexandrov, Boian S.; Stanev, Valentin G.; Vesselinov, Velimir V.
  • Statistical Analysis and Data Mining: The ASA Data Science Journal, Vol. 12, Issue 4
  • DOI: 10.1002/sam.11407

Pearson Correlation Coefficient
book, January 2009


Deciphering Signatures of Mutational Processes Operative in Human Cancer
journal, January 2013


Biclustering and classification analysis in gene expression using Nonnegative Matrix Factorization on multi-GPU systems
conference, November 2011

  • Mejia-Roa, E.; Garcia, C.; Gomez, J. I.
  • 2011 11th International Conference on Intelligent Systems Design and Applications (ISDA)
  • DOI: 10.1109/ISDA.2011.6121769

Unsupervised phase mapping of X-ray diffraction data by nonnegative matrix factorization integrated with custom clustering
journal, August 2018

  • Stanev, Valentin; Vesselinov, Velimir V.; Kusne, A. Gilad
  • npj Computational Materials, Vol. 4, Issue 1
  • DOI: 10.1038/s41524-018-0099-2

Armadillo: a template-based C++ library for linear algebra
journal, June 2016

  • Sanderson, Conrad; Curtin, Ryan
  • The Journal of Open Source Software, Vol. 1, Issue 2
  • DOI: 10.21105/joss.00026

NMF-mGPU: non-negative matrix factorization on multi-GPU systems
journal, February 2015

  • Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier
  • BMC Bioinformatics, Vol. 16, Issue 1
  • DOI: 10.1186/s12859-015-0485-4

Unsupervised Learning
journal, September 1989


Parallel Nonnegative Matrix Factorization Algorithm on the Distributed Memory Platform
journal, September 2009

  • Dong, Chao; Zhao, Huijie; Wang, Wei
  • International Journal of Parallel Programming, Vol. 38, Issue 2
  • DOI: 10.1007/s10766-009-0116-7

Large-scale matrix factorization with distributed stochastic gradient descent
conference, January 2011

  • Gemulla, Rainer; Nijkamp, Erik; Haas, Peter J.
  • Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '11
  • DOI: 10.1145/2020408.2020426

Blind source separation for groundwater pressure analysis based on nonnegative matrix factorization
journal, September 2014

  • Alexandrov, Boian S.; Vesselinov, Velimir V.
  • Water Resources Research, Vol. 50, Issue 9
  • DOI: 10.1002/2013WR015037

Learning the parts of objects by non-negative matrix factorization
journal, October 1999

  • Lee, Daniel D.; Seung, H. Sebastian
  • Nature, Vol. 401, Issue 6755
  • DOI: 10.1038/44565

Behavioral clusters in dynamic graphs
journal, August 2015


Toward Faster Nonnegative Matrix Factorization: A New Algorithm and Comparisons
conference, December 2008

  • Kim, Jingu; Park, Haesun
  • 2008 Eighth IEEE International Conference on Data Mining (ICDM)
  • DOI: 10.1109/ICDM.2008.149

Performance Optimization of Multi-Core Grammatical Evolution Generated Parallel Recursive Programs
conference, January 2015

  • Chennupati, Gopinath; Azad, R. Muhammad Atif; Ryan, Conor
  • Proceedings of the 2015 on Genetic and Evolutionary Computation Conference - GECCO '15
  • DOI: 10.1145/2739480.2754746

Identification of release sources in advection–diffusion system by machine learning combined with Green’s function inverse method
journal, August 2018


Non-negative Matrix Factorization Implementation Using Graphic Processing Units
book, January 2010


A stable approach for model order selection in nonnegative matrix factorization
journal, March 2015


Nonnegative Matrix Factorization for identification of unknown number of sources emitting delayed signals
journal, March 2018


PLS-regression: a basic tool of chemometrics
journal, October 2001

  • Wold, Svante; Sjöström, Michael; Eriksson, Lennart
  • Chemometrics and Intelligent Laboratory Systems, Vol. 58, Issue 2
  • DOI: 10.1016/S0169-7439(01)00155-1

A high-performance parallel algorithm for nonnegative matrix factorization
journal, February 2016

  • Kannan, Ramakrishnan; Ballard, Grey; Park, Haesun
  • ACM SIGPLAN Notices, Vol. 51, Issue 8
  • DOI: 10.1145/3016078.2851152

FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop
text, January 2014

  • Beutel, Alex; Kumar, Abhimanu; Papalexakis, Evangelos E.
  • Carnegie Mellon University
  • DOI: 10.1184/r1/6475637.v1

Unsupervised Learning: Foundations of Neural Computation
January 1999


Towards clinically actionable digital phenotyping targets in schizophrenia
journal, May 2020


Deciphering signatures of mutational processes operative in human cancer.
text, January 2013

  • Alexandrov, Ludmil B.; Nik-Zainal Abidin, Serena; Wedge, David C.
  • Apollo - University of Cambridge Repository
  • DOI: 10.17863/cam.60112

General Intelligence, Objectively Determined and Measured.
journal, January 1905


Signatures of mutational processes in human cancer.
text, January 2013

  • Alexandrov, Ludmil B.; Nik-Zainal Abidin, Serena; Wedge, David C.
  • Apollo - University of Cambridge Repository
  • DOI: 10.17863/cam.31889

Statistical Inference, Learning and Models in Big Data
text, January 2015


A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization
preprint, January 2015