# HPC formulations of optimization algorithms for tensor completion

## Abstract

Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication. Among our advancements are a communication-efficient CCD++ algorithm, an ALS algorithm rich in level-3 BLAS routines, and an SGD algorithm which combines stratification with asynchronous communication. Furthermore, we show that introducing randomization during ALS and CCD++ can accelerate convergence. We evaluate our parallel formulations on a variety of real datasets on a modern supercomputer and demonstrate speedups through 16384 cores. These improvements reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS andmore »

- Publication Date:

- Research Org.:
- Lawrence Berkeley National Laboratory-National Energy Research Scientific Computing Center

- Sponsoring Org.:
- USDOE Office of Science (SC)

- OSTI Identifier:
- 1478749

- DOE Contract Number:
- AC02- 05CH11231

- Resource Type:
- Journal Article

- Journal Name:
- Parallel Computing

- Additional Journal Information:
- Journal Volume: 74; Journal Issue: C; Journal ID: ISSN 0167-8191

- Country of Publication:
- United States

- Language:
- English

### Citation Formats

```
None.
```*HPC formulations of optimization algorithms for tensor completion*. United States: N. p., 2018.
Web. doi:10.1016/j.parco.2017.11.002.

```
None.
```*HPC formulations of optimization algorithms for tensor completion*. United States. doi:10.1016/j.parco.2017.11.002.

```
None. Tue .
"HPC formulations of optimization algorithms for tensor completion". United States. doi:10.1016/j.parco.2017.11.002.
```

```
@article{osti_1478749,
```

title = {HPC formulations of optimization algorithms for tensor completion},

author = {None},

abstractNote = {Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication. Among our advancements are a communication-efficient CCD++ algorithm, an ALS algorithm rich in level-3 BLAS routines, and an SGD algorithm which combines stratification with asynchronous communication. Furthermore, we show that introducing randomization during ALS and CCD++ can accelerate convergence. We evaluate our parallel formulations on a variety of real datasets on a modern supercomputer and demonstrate speedups through 16384 cores. These improvements reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS and CCD++ provide the lowest time-to-solution on large-scale distributed systems.},

doi = {10.1016/j.parco.2017.11.002},

journal = {Parallel Computing},

issn = {0167-8191},

number = C,

volume = 74,

place = {United States},

year = {2018},

month = {5}

}