skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: HPC formulations of optimization algorithms for tensor completion

Abstract

Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication. Among our advancements are a communication-efficient CCD++ algorithm, an ALS algorithm rich in level-3 BLAS routines, and an SGD algorithm which combines stratification with asynchronous communication. Furthermore, we show that introducing randomization during ALS and CCD++ can accelerate convergence. We evaluate our parallel formulations on a variety of real datasets on a modern supercomputer and demonstrate speedups through 16384 cores. These improvements reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS andmore » CCD++ provide the lowest time-to-solution on large-scale distributed systems.« less

Publication Date:
Research Org.:
Lawrence Berkeley National Laboratory-National Energy Research Scientific Computing Center
Sponsoring Org.:
USDOE Office of Science (SC)
OSTI Identifier:
1478749
DOE Contract Number:  
AC02- 05CH11231
Resource Type:
Journal Article
Journal Name:
Parallel Computing
Additional Journal Information:
Journal Volume: 74; Journal Issue: C; Journal ID: ISSN 0167-8191
Country of Publication:
United States
Language:
English

Citation Formats

None. HPC formulations of optimization algorithms for tensor completion. United States: N. p., 2018. Web. doi:10.1016/j.parco.2017.11.002.
None. HPC formulations of optimization algorithms for tensor completion. United States. doi:10.1016/j.parco.2017.11.002.
None. Tue . "HPC formulations of optimization algorithms for tensor completion". United States. doi:10.1016/j.parco.2017.11.002.
@article{osti_1478749,
title = {HPC formulations of optimization algorithms for tensor completion},
author = {None},
abstractNote = {Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication. Among our advancements are a communication-efficient CCD++ algorithm, an ALS algorithm rich in level-3 BLAS routines, and an SGD algorithm which combines stratification with asynchronous communication. Furthermore, we show that introducing randomization during ALS and CCD++ can accelerate convergence. We evaluate our parallel formulations on a variety of real datasets on a modern supercomputer and demonstrate speedups through 16384 cores. These improvements reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS and CCD++ provide the lowest time-to-solution on large-scale distributed systems.},
doi = {10.1016/j.parco.2017.11.002},
journal = {Parallel Computing},
issn = {0167-8191},
number = C,
volume = 74,
place = {United States},
year = {2018},
month = {5}
}