Programming with BIG data in R: Scaling analytics from one to thousands of nodes

Schmidt, Drew; Chen, Wei -Chen; Matheson, Michael A.; Ostrouchov, George

doi:10.1016/j.bdr.2016.10.002

Title: Programming with BIG data in R: Scaling analytics from one to thousands of nodes

Abstract

Here, we present a tutorial overview showing how one can achieve scalable performance with R. We do so by utilizing several package extensions, including those from the pbdR project. These packages consist of high performance, high-level interfaces to and extensions of MPI, PBLAS, ScaLAPACK, I/O libraries, profiling libraries, and more. While these libraries shine brightest on large distributed platforms, they also work rather well on small clusters and often, surprisingly, even on a laptop with only two cores. Our tutorial begins with recommendations on how to get more performance out of your R code before considering parallel implementations. Because R is a high-level language, a function can have a deep hierarchy of operations. For big data, this can easily lead to inefficiency. Profiling is an important tool to understand the performance of an R code for both serial and parallel improvements.

Authors:

Schmidt, Drew ^[1]; Chen, Wei -Chen ^[2]; Matheson, Michael A. ^[3]; Ostrouchov, George ^[4]

Univ. of Tennessee, Knoxville, TN (United States)
U.S. Food and Drug Administration, Silver Spring, MD (United States)
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Univ. of Tennessee, Knoxville, TN (United States); Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)

Publication Date:: Wed Nov 09 00:00:00 EST 2016

Research Org.:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Joint Institute for Computational Sciences (JICS)

Sponsoring Org.:: Work for Others (WFO); USDOE Office of Science (SC)

OSTI Identifier:: 1333101

Alternate Identifier(s):: OSTI ID: 1416808

Grant/Contract Number:: AC05-00OR22725

Resource Type:: Accepted Manuscript

Journal Name:: Big Data Research

Additional Journal Information:: Journal Name: Big Data Research; Journal ID: ISSN 2214-5796

Publisher:: Elsevier

Country of Publication:: United States

Language:: English

Subject:: 97 MATHEMATICS AND COMPUTING

Citation Formats


                    Schmidt, Drew, Chen, Wei -Chen, Matheson, Michael A., and Ostrouchov, George. Programming with BIG data in R: Scaling analytics from one to thousands of nodes.  United States: N. p., 2016. 
Web.  doi:10.1016/j.bdr.2016.10.002.

Copy to clipboard


                    Schmidt, Drew, Chen, Wei -Chen, Matheson, Michael A., & Ostrouchov, George. Programming with BIG data in R: Scaling analytics from one to thousands of nodes.  United States.  https://doi.org/10.1016/j.bdr.2016.10.002

Copy to clipboard


                    Schmidt, Drew, Chen, Wei -Chen, Matheson, Michael A., and Ostrouchov, George. Wed .  
"Programming with BIG data in R: Scaling analytics from one to thousands of nodes".  United States.  https://doi.org/10.1016/j.bdr.2016.10.002.  https://www.osti.gov/servlets/purl/1333101.

Copy to clipboard


                    
@article{osti_1333101,

  title        = {Programming with BIG data in R: Scaling analytics from one to thousands of nodes},

  author       = {Schmidt, Drew and Chen, Wei -Chen and Matheson, Michael A. and Ostrouchov, George},

  abstractNote = {Here, we present a tutorial overview showing how one can achieve scalable performance with R. We do so by utilizing several package extensions, including those from the pbdR project. These packages consist of high performance, high-level interfaces to and extensions of MPI, PBLAS, ScaLAPACK, I/O libraries, profiling libraries, and more. While these libraries shine brightest on large distributed platforms, they also work rather well on small clusters and often, surprisingly, even on a laptop with only two cores. Our tutorial begins with recommendations on how to get more performance out of your R code before considering parallel implementations. Because R is a high-level language, a function can have a deep hierarchy of operations. For big data, this can easily lead to inefficiency. Profiling is an important tool to understand the performance of an R code for both serial and parallel improvements.},

  doi          = {10.1016/j.bdr.2016.10.002},

  journal      = {Big Data Research},

  number       = ,

  volume       = ,

  place        = {United States},

  year         = {Wed Nov 09 00:00:00 EST 2016},

  month        = {Wed Nov 09 00:00:00 EST 2016}

}

Copy to clipboard

Journal Article:

Free Publicly Available Full Text

Accepted Manuscript (Publisher)

Accepted Manuscript (DOE)

Publisher's Version of Record

https://doi.org/10.1016/j.bdr.2016.10.002

Other availability

Search WorldCat to find libraries that may hold this journal

Citation Metrics:

Cited by: 11 works

Citation information provided by
Web of Science

Save / Share:

Export Metadata

Save to My Library

Works referenced in this record:

RcppArmadillo: Accelerating R with high-performance C++ linear algebra
journal, March 2014

Eddelbuettel, Dirk; Sanderson, Conrad
Computational Statistics & Data Analysis, Vol. 71
DOI: 10.1016/j.csda.2013.02.005

Singular value decomposition and least squares solutions
journal, April 1970

Golub, G. H.; Reinsch, C.
Numerische Mathematik, Vol. 14, Issue 5
DOI: 10.1007/BF02163027

Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions
journal, January 2011

Halko, N.; Martinsson, P. G.; Tropp, J. A.
SIAM Review, Vol. 53, Issue 2
DOI: 10.1137/090771806

Basic Linear Algebra Subprograms for Fortran Usage
journal, September 1979

Lawson, C. L.; Hanson, R. J.; Kincaid, D. R.
ACM Transactions on Mathematical Software, Vol. 5, Issue 3
DOI: 10.1145/355841.355847

Methods of Multivariate Analysis
book, January 2002

Rencher, Alvin C.
Wiley Series in Probability and Statistics
DOI: 10.1002/0471271357

Works referencing / citing this record:

Big Data Analytics: A Review on Theoretical Contributions and Tools Used in Literature
journal, June 2017

Grover, Purva; Kar, Arpan Kumar
Global Journal of Flexible Systems Management, Vol. 18, Issue 3
DOI: 10.1007/s40171-017-0159-3

Network Design towards Sustainability of Chinese Baijiu Industry from a Supply Chain Perspective
journal, November 2018

Jiang, Xianglan; Xu, Jiuping; Luo, Jiarong
Discrete Dynamics in Nature and Society, Vol. 2018
DOI: 10.1155/2018/4391351

Situating Ecology as a Big-Data Science: Current Advances, Challenges, and Solutions
journal, July 2018

Farley, Scott S.; Dawson, Andria; Goring, Simon J.
BioScience, Vol. 68, Issue 8
DOI: 10.1093/biosci/biy068

Similar Records in DOE PAGES and OSTI.GOV collections:

PETSc/TAO Users Manual (Rev. 3.19)

Technical Report Balay, S. ; Abhyankar, S. ; Adams, Mark F. ; ...

This manual describes the use of the Portable, Extensible Toolkit for Scientific Computation (PETSc) and the Toolkit for Advanced Optimization (TAO) for the numerical solution of partial differential equations and related problems on high-performance computers. PETSc/TAO is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all distributed memory communication. PETSc/TAO includes a large suite of parallel linear solvers, nonlinear solvers, time integrators, and opti mization that may be used in application codes written in Fortran, C, C++, andmore »« less
https://doi.org/10.2172/1968587

Full Text Available
PETSc Users Manual (Rev. 3.3)

Technical Report Balay, S. ; Brown, J. ; Buschelman, K. ; ...

This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication. PETSc includes an expanding suite of parallel linear, nonlinear equation solvers and time integrators that may be used in application codes written in Fortran, C, C++, Python, and MATLAB (sequential). PETSc provides many of the mechanisms neededmore »« less
https://doi.org/10.2172/1178102

Full Text Available
PETSc Users Manual (Rev. 3.4)

Technical Report Balay, S. ; Brown, J. ; Buschelman, K. ; ...

This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication. PETSc includes an expanding suite of parallel linear, nonlinear equation solvers and time integrators that may be used in application codes written in Fortran, C, C++, Python, and MATLAB (sequential). PETSc provides many of the mechanisms neededmore »« less
https://doi.org/10.2172/1178104

Full Text Available
PETSc Users Manual (Rev. 3.5)

Technical Report Balay, S. ; Abhyankar, S. ; Adams, M. ; ...

This manual describes the use of PETSc for the numerical solution of partial differential equations and related problems on high-performance computers. The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines that provide the building blocks for the implementation of large-scale application codes on parallel (and serial) computers. PETSc uses the MPI standard for all message-passing communication. PETSc includes an expanding suite of parallel linear, nonlinear equation solvers and time integrators that may be used in application codes written in Fortran, C, C++, Python, and MATLAB (sequential). PETSc provides many of the mechanisms neededmore »« less
https://doi.org/10.2172/1178109

Full Text Available
Optimization and Parallelization of the Thermal-Hydraulic Sub-channel Code CTF for High-Fidelity Multi-physics Applications

Journal Article Salko, Robert K ; Schmidt, Rodney ; Avramova, Maria N - Annals of Nuclear Energy

This paper describes major improvements to the computational infrastructure of the CTF sub-channel code so that full-core sub-channel-resolved simulations can now be performed in much shorter run-times, either in stand-alone mode or as part of coupled-code multi-physics calculations. These improvements support the goals of the Department Of Energy (DOE) Consortium for Advanced Simulations of Light Water (CASL) Energy Innovation Hub to develop high fidelity multi-physics simulation tools for nuclear energy design and analysis. A set of serial code optimizations--including fixing computational inefficiencies, optimizing the numerical approach, and making smarter data storage choices--are first described and shown to reduce both executionmore »« less

Similar Records

Title: Programming with BIG data in R: Scaling analytics from one to thousands of nodes

Abstract

Citation Formats

RcppArmadillo: Accelerating R with high-performance C++ linear algebra journal, March 2014

Singular value decomposition and least squares solutions journal, April 1970

Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions journal, January 2011

Basic Linear Algebra Subprograms for Fortran Usage journal, September 1979

Methods of Multivariate Analysis book, January 2002

Big Data Analytics: A Review on Theoretical Contributions and Tools Used in Literature journal, June 2017

Network Design towards Sustainability of Chinese Baijiu Industry from a Supply Chain Perspective journal, November 2018

Situating Ecology as a Big-Data Science: Current Advances, Challenges, and Solutions journal, July 2018

RcppArmadillo: Accelerating R with high-performance C++ linear algebra
journal, March 2014

Singular value decomposition and least squares solutions
journal, April 1970

Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions
journal, January 2011

Basic Linear Algebra Subprograms for Fortran Usage
journal, September 1979

Methods of Multivariate Analysis
book, January 2002

Big Data Analytics: A Review on Theoretical Contributions and Tools Used in Literature
journal, June 2017

Network Design towards Sustainability of Chinese Baijiu Industry from a Supply Chain Perspective
journal, November 2018

Situating Ecology as a Big-Data Science: Current Advances, Challenges, and Solutions
journal, July 2018