Programming with BIG data in R: Scaling analytics from one to thousands of nodes

Schmidt, Drew; Chen, Wei -Chen; Matheson, Michael A.; Ostrouchov, George

doi:10.1016/j.bdr.2016.10.002

Programming with BIG data in R: Scaling analytics from one to thousands of nodes

Journal Article · Wed Nov 09 00:00:00 EST 2016 · Big Data Research

DOI:https://doi.org/10.1016/j.bdr.2016.10.002· OSTI ID:1333101

Schmidt, Drew ^[1]; Chen, Wei -Chen ^[2]; Matheson, Michael A. ^[3]; Ostrouchov, George ^[4]

Univ. of Tennessee, Knoxville, TN (United States)
U.S. Food and Drug Administration, Silver Spring, MD (United States)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Here, we present a tutorial overview showing how one can achieve scalable performance with R. We do so by utilizing several package extensions, including those from the pbdR project. These packages consist of high performance, high-level interfaces to and extensions of MPI, PBLAS, ScaLAPACK, I/O libraries, profiling libraries, and more. While these libraries shine brightest on large distributed platforms, they also work rather well on small clusters and often, surprisingly, even on a laptop with only two cores. Our tutorial begins with recommendations on how to get more performance out of your R code before considering parallel implementations. Because R is a high-level language, a function can have a deep hierarchy of operations. For big data, this can easily lead to inefficiency. Profiling is an important tool to understand the performance of an R code for both serial and parallel improvements.

Research Organization:: Joint Institute for Computational Sciences (JICS); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)

Sponsoring Organization:: ME USDOE - Office of Management, Budget, and Evaluation; ORNL work for others; USDOE; USDOE Office of Science (SC)

Grant/Contract Number:: AC05-00OR22725

OSTI ID:: 1333101

Alternate ID(s):: OSTI ID: 1416808

Journal Information:: Big Data Research, Journal Name: Big Data Research; ISSN 2214-5796

Publisher:: ElsevierCopyright Statement

Country of Publication:: United States

Language:: English

References (5)

Methods of Multivariate Analysis Rencher, Alvin C. Wiley Series in Probability and Statistics https://doi.org/10.1002/0471271357	book	January 2002
Singular value decomposition and least squares solutions Golub, G. H.; Reinsch, C. Numerische Mathematik, Vol. 14, Issue 5 https://doi.org/10.1007/BF02163027	journal	April 1970
RcppArmadillo: Accelerating R with high-performance C++ linear algebra Eddelbuettel, Dirk; Sanderson, Conrad Computational Statistics & Data Analysis, Vol. 71 https://doi.org/10.1016/j.csda.2013.02.005	journal	March 2014
Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions Halko, N.; Martinsson, P. G.; Tropp, J. A. SIAM Review, Vol. 53, Issue 2 https://doi.org/10.1137/090771806	journal	January 2011
Basic Linear Algebra Subprograms for Fortran Usage Lawson, C. L.; Hanson, R. J.; Kincaid, D. R. ACM Transactions on Mathematical Software, Vol. 5, Issue 3 https://doi.org/10.1145/355841.355847	journal	September 1979

Cited By (3)

Big Data Analytics: A Review on Theoretical Contributions and Tools Used in Literature Grover, Purva; Kar, Arpan Kumar Global Journal of Flexible Systems Management, Vol. 18, Issue 3 https://doi.org/10.1007/s40171-017-0159-3	journal	June 2017
Situating Ecology as a Big-Data Science: Current Advances, Challenges, and Solutions Farley, Scott S.; Dawson, Andria; Goring, Simon J. BioScience, Vol. 68, Issue 8 https://doi.org/10.1093/biosci/biy068	journal	July 2018
Network Design towards Sustainability of Chinese Baijiu Industry from a Supply Chain Perspective Jiang, Xianglan; Xu, Jiuping; Luo, Jiarong Discrete Dynamics in Nature and Society, Vol. 2018 https://doi.org/10.1155/2018/4391351	journal	November 2018

Similar Records

PETSc/TAO Users Manual (Rev. 3.19)

Technical Report · Thu Mar 30 00:00:00 EDT 2023 · OSTI ID:1968587

A Scalable Graph Analytics Framework for Programming with Big Data in R (pbdR)

Conference · Sat Nov 30 23:00:00 EST 2019 · OSTI ID:1649576

A portable implementation of ARPACK for distributed memory parallel architectures

Conference · Mon Dec 30 23:00:00 EST 1996 · OSTI ID:433364

Related Subjects

97 MATHEMATICS AND COMPUTING

Programming with BIG data in R: Scaling analytics from one to thousands of nodes

Citation Formats

References (5)

Cited By (3)

Similar Records

Related Subjects