DOE PAGES title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Programming with BIG data in R: Scaling analytics from one to thousands of nodes

Abstract

Here, we present a tutorial overview showing how one can achieve scalable performance with R. We do so by utilizing several package extensions, including those from the pbdR project. These packages consist of high performance, high-level interfaces to and extensions of MPI, PBLAS, ScaLAPACK, I/O libraries, profiling libraries, and more. While these libraries shine brightest on large distributed platforms, they also work rather well on small clusters and often, surprisingly, even on a laptop with only two cores. Our tutorial begins with recommendations on how to get more performance out of your R code before considering parallel implementations. Because R is a high-level language, a function can have a deep hierarchy of operations. For big data, this can easily lead to inefficiency. Profiling is an important tool to understand the performance of an R code for both serial and parallel improvements.

Authors:
 [1];  [2];  [3];  [4]
  1. Univ. of Tennessee, Knoxville, TN (United States)
  2. U.S. Food and Drug Administration, Silver Spring, MD (United States)
  3. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
  4. Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Publication Date:
Research Org.:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Joint Institute for Computational Sciences (JICS)
Sponsoring Org.:
Work for Others (WFO); USDOE Office of Science (SC)
OSTI Identifier:
1333101
Alternate Identifier(s):
OSTI ID: 1416808
Grant/Contract Number:  
AC05-00OR22725
Resource Type:
Accepted Manuscript
Journal Name:
Big Data Research
Additional Journal Information:
Journal Name: Big Data Research; Journal ID: ISSN 2214-5796
Publisher:
Elsevier
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Schmidt, Drew, Chen, Wei -Chen, Matheson, Michael A., and Ostrouchov, George. Programming with BIG data in R: Scaling analytics from one to thousands of nodes. United States: N. p., 2016. Web. doi:10.1016/j.bdr.2016.10.002.
Schmidt, Drew, Chen, Wei -Chen, Matheson, Michael A., & Ostrouchov, George. Programming with BIG data in R: Scaling analytics from one to thousands of nodes. United States. https://doi.org/10.1016/j.bdr.2016.10.002
Schmidt, Drew, Chen, Wei -Chen, Matheson, Michael A., and Ostrouchov, George. Wed . "Programming with BIG data in R: Scaling analytics from one to thousands of nodes". United States. https://doi.org/10.1016/j.bdr.2016.10.002. https://www.osti.gov/servlets/purl/1333101.
@article{osti_1333101,
title = {Programming with BIG data in R: Scaling analytics from one to thousands of nodes},
author = {Schmidt, Drew and Chen, Wei -Chen and Matheson, Michael A. and Ostrouchov, George},
abstractNote = {Here, we present a tutorial overview showing how one can achieve scalable performance with R. We do so by utilizing several package extensions, including those from the pbdR project. These packages consist of high performance, high-level interfaces to and extensions of MPI, PBLAS, ScaLAPACK, I/O libraries, profiling libraries, and more. While these libraries shine brightest on large distributed platforms, they also work rather well on small clusters and often, surprisingly, even on a laptop with only two cores. Our tutorial begins with recommendations on how to get more performance out of your R code before considering parallel implementations. Because R is a high-level language, a function can have a deep hierarchy of operations. For big data, this can easily lead to inefficiency. Profiling is an important tool to understand the performance of an R code for both serial and parallel improvements.},
doi = {10.1016/j.bdr.2016.10.002},
journal = {Big Data Research},
number = ,
volume = ,
place = {United States},
year = {Wed Nov 09 00:00:00 EST 2016},
month = {Wed Nov 09 00:00:00 EST 2016}
}

Journal Article:

Citation Metrics:
Cited by: 11 works
Citation information provided by
Web of Science

Save / Share:

Works referenced in this record:

RcppArmadillo: Accelerating R with high-performance C++ linear algebra
journal, March 2014


Singular value decomposition and least squares solutions
journal, April 1970

  • Golub, G. H.; Reinsch, C.
  • Numerische Mathematik, Vol. 14, Issue 5
  • DOI: 10.1007/BF02163027

Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions
journal, January 2011

  • Halko, N.; Martinsson, P. G.; Tropp, J. A.
  • SIAM Review, Vol. 53, Issue 2
  • DOI: 10.1137/090771806

Basic Linear Algebra Subprograms for Fortran Usage
journal, September 1979

  • Lawson, C. L.; Hanson, R. J.; Kincaid, D. R.
  • ACM Transactions on Mathematical Software, Vol. 5, Issue 3
  • DOI: 10.1145/355841.355847

Methods of Multivariate Analysis
book, January 2002


Works referencing / citing this record:

Big Data Analytics: A Review on Theoretical Contributions and Tools Used in Literature
journal, June 2017

  • Grover, Purva; Kar, Arpan Kumar
  • Global Journal of Flexible Systems Management, Vol. 18, Issue 3
  • DOI: 10.1007/s40171-017-0159-3

Network Design towards Sustainability of Chinese Baijiu Industry from a Supply Chain Perspective
journal, November 2018

  • Jiang, Xianglan; Xu, Jiuping; Luo, Jiarong
  • Discrete Dynamics in Nature and Society, Vol. 2018
  • DOI: 10.1155/2018/4391351

Situating Ecology as a Big-Data Science: Current Advances, Challenges, and Solutions
journal, July 2018

  • Farley, Scott S.; Dawson, Andria; Goring, Simon J.
  • BioScience, Vol. 68, Issue 8
  • DOI: 10.1093/biosci/biy068