Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Gittens, Alex; Devarakonda, Aditya; Racah, Evan; Ringenburg, Michael; Gerhardt, Lisa; Kottalam, Jey; Liu, Jialin; Maschhoff, Kristyn; Canon, Shane; Chhugani, Jatin; Sharma, Pramod; Yang, Jiyan; Demmel, James; Harrell, Jim; Krishnamurthy, Venkat; Mahoney, Michael; Prabhat, Mr

Title: Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Conference · Thu May 12 00:00:00 EDT 2016

OSTI ID:1332132

Gittens, Alex; Devarakonda, Aditya; Racah, Evan; Ringenburg, Michael; Gerhardt, Lisa; Kottalam, Jey; Liu, Jialin; Maschhoff, Kristyn; Canon, Shane; Chhugani, Jatin; Sharma, Pramod; Yang, Jiyan; Demmel, James; Harrell, Jim; Krishnamurthy, Venkat; Mahoney, Michael; Prabhat, Mr

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to 1.6TB particle physics, 2.2TB and 16TB climate modeling and 1.1TB bioimaging data. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark’s data parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.

View Conference

Cite

Export

Save

Research Organization:: Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)

Sponsoring Organization:: National Energy Research Scientific Computing Division

OSTI ID:: 1332132

Report Number(s):: LBNL-1006428; ir:1006428

Resource Relation:: Conference: 2016 IEEE International Conference on Big Data, Washington DC, USA, 05/12/2016

Country of Publication:: United States

Language:: English

Similar Records

A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark

Conference · Thu Jul 27 00:00:00 EDT 2017 · OSTI ID:1332132

Gittens, Alex; Kottalam, Jey; Yang, Jiyan; +9 more

HPC Global File System Performance Analysis Using A Scientific-Application Derived Benchmark

Journal Article · Thu Aug 28 00:00:00 EDT 2008 · Parallel Computing Systems&Applications · OSTI ID:1332132

Borrill, Julian; Oliker, Leonid; Shalf, John; +2 more

Improving MPI Collective I/O for High Volume Non-Contiguous Requests With Intra-Node Aggregation

Journal Article · Fri Jun 05 00:00:00 EDT 2020 · IEEE Transactions on Parallel and Distributed Systems · OSTI ID:1332132

Kang, Qiao; Lee, Sunwoo; Hou, Kaiyuan; +4 more

Title: Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Citation Formats

Similar Records

Related Subjects