skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark

Abstract

We investigate the performance and scalability of the randomized CX low-rank matrix factorization and demonstrate its applicability through the analysis of a 1TB mass spectrometry imaging (MSI) dataset, using Apache Spark on an Amazon EC2 cluster, a Cray XC40 system, and an experimental Cray cluster. We implemented this factorization both as a parallelized C implementation with hand-tuned optimizations and in Scala using the Apache Spark high-level cluster computing framework. We obtained consistent performance across the three platforms: using Spark we were able to process the 1TB size dataset in under 30 minutes with 960 cores on all systems, with the fastest times obtained on the experimental Cray cluster. In comparison, the C implementation was 21X faster on the Amazon EC2 system, due to careful cache optimizations, bandwidth-friendly access of matrices and vector computation using SIMD units. We report these results and their implications on the hardware and software issues arising in supporting data-centric workloads in parallel and distributed environments.

Authors:
; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publication Date:
Research Org.:
Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States)
Sponsoring Org.:
Computational Research Division, National Energy Research Scientific Computing Division
OSTI Identifier:
1372901
Report Number(s):
LBNL-1005719
ir:1005719
Resource Type:
Conference
Country of Publication:
United States
Language:
English

Citation Formats

Gittens, Alex, Kottalam, Jey, Yang, Jiyan, Ringenburg, Michael, F., Chhugani, Jatin, Racah, Evan, Singh, Mohitdeep, Yao, Yushu, Fischer, Curt, Ruebel, Oliver, Bowen, Benjamin, Lewis, Norman, G., Mahoney, Michael, W., Krishnamurthy, Venkat, and Prabhat, Mr. A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark. United States: N. p., 2017. Web. doi:10.1109/IPDPSW.2016.114.
Gittens, Alex, Kottalam, Jey, Yang, Jiyan, Ringenburg, Michael, F., Chhugani, Jatin, Racah, Evan, Singh, Mohitdeep, Yao, Yushu, Fischer, Curt, Ruebel, Oliver, Bowen, Benjamin, Lewis, Norman, G., Mahoney, Michael, W., Krishnamurthy, Venkat, & Prabhat, Mr. A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark. United States. doi:10.1109/IPDPSW.2016.114.
Gittens, Alex, Kottalam, Jey, Yang, Jiyan, Ringenburg, Michael, F., Chhugani, Jatin, Racah, Evan, Singh, Mohitdeep, Yao, Yushu, Fischer, Curt, Ruebel, Oliver, Bowen, Benjamin, Lewis, Norman, G., Mahoney, Michael, W., Krishnamurthy, Venkat, and Prabhat, Mr. 2017. "A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark". United States. doi:10.1109/IPDPSW.2016.114. https://www.osti.gov/servlets/purl/1372901.
@article{osti_1372901,
title = {A multi-platform evaluation of the randomized CX low-rank matrix factorization in Spark},
author = {Gittens, Alex and Kottalam, Jey and Yang, Jiyan and Ringenburg, Michael, F. and Chhugani, Jatin and Racah, Evan and Singh, Mohitdeep and Yao, Yushu and Fischer, Curt and Ruebel, Oliver and Bowen, Benjamin and Lewis, Norman, G. and Mahoney, Michael, W. and Krishnamurthy, Venkat and Prabhat, Mr},
abstractNote = {We investigate the performance and scalability of the randomized CX low-rank matrix factorization and demonstrate its applicability through the analysis of a 1TB mass spectrometry imaging (MSI) dataset, using Apache Spark on an Amazon EC2 cluster, a Cray XC40 system, and an experimental Cray cluster. We implemented this factorization both as a parallelized C implementation with hand-tuned optimizations and in Scala using the Apache Spark high-level cluster computing framework. We obtained consistent performance across the three platforms: using Spark we were able to process the 1TB size dataset in under 30 minutes with 960 cores on all systems, with the fastest times obtained on the experimental Cray cluster. In comparison, the C implementation was 21X faster on the Amazon EC2 system, due to careful cache optimizations, bandwidth-friendly access of matrices and vector computation using SIMD units. We report these results and their implications on the hardware and software issues arising in supporting data-centric workloads in parallel and distributed environments.},
doi = {10.1109/IPDPSW.2016.114},
journal = {},
number = ,
volume = ,
place = {United States},
year = 2017,
month = 7
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Abstract not provided.
  • Rank tests provide an alternative to the usual normal theory F-test for the analysis of data from randomized complete blocks experiments. Two such rank tests are the Friedman test which employs the method of n-rankings and the rank transformation procedure which employs an overall ranking of the data. In this paper the asymptotic efficiency of the rank transformation procedure is developed and compared to the asymptotic efficiencies of Friedman's test and the usual F-test. These efficiencies are developed using contiguous alternatives that are shifts in location. Comparisons among the three tests are made using normal, Student, and double exponential withinmore » block distributions. Block effects are introduced by drawing location shifts from normal and uniform distributions and, also, by drawing scale changes from an inverted gamma densities. The asymptotic relative efficiencies were evaluated using numerical procedures.« less
  • We analyze the parallel performance of randomized interpolative decomposition by de- composing low rank complex-valued Gaussian random matrices larger than 100 GB. We chose a Cray XMT supercomputer as it provides an almost ideal PRAM model permitting quick investigation of parallel algorithms without obfuscation from hardware idiosyncrasies. We obtain that on non-square matrices performance scales almost linearly with runtime about 100 times faster on 128 processors. We also verify that numerically discovered error bounds still hold on matrices two orders of magnitude larger than those previously tested.
  • This paper presents a new algorithm for computing the QR factorization of a rank-deficient matrix that is well suited for high-performance machines. These machines typically employ a memory hierarchy and matrix-matrix operations perform better on those machines than matrix-vector or vector-vector operations since they require significantly less data movement per floating point operation. The traditional QR factorization algorithm with column pivoting is not well suited for such environments since it precludes the use of matrix-matrix operations. Instead, we suggest a restricted pivoting strategy based on incremental condition estimation which allows us to formulate a block QR factorization algorithm where themore » bulk of the work is in matrix-matrix operations. Performance results on the Cray 2, Cray X-MP and Cray Y-MP show that the new algorithm performs significantly better than the traditional scheme and can more than halve the cost of computing the QR factorization. 19 refs., 1 fig., 1 tab.« less
  • We study randomized techniques for designing efficient algorithms on a p-processor bulk-synchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processor-to-processor communication rounds provided each processor is guaranteed to send and receive at most h items in any round. The measure of efficiency we use is in terms of the internal computation time of the processors and the number of communication rounds needed to solve the problem at hand. We present techniques that achieve optimal efficiency in these bounds over all possible values for p, and we call such techniques fully-scalable for this reason. In particular,more » we address two fundamental problems: multi-searching and convex hull construction. Our methods result in algorithms that use internal time that is O(n log n/p) and, for h = {Theta}(n/p), a number of communication rounds that is O(log n/log (h + 1)) with high probability. Both of these bounds are asymptotically optimal for the BSP model.« less