skip to main content

Title: Machine Learning for Big Data: A Study to Understand Limits at Scale

This report aims to empirically understand the limits of machine learning when applied to Big Data. We observe that recent innovations in being able to collect, access, organize, integrate, and query massive amounts of data from a wide variety of data sources have brought statistical data mining and machine learning under more scrutiny, evaluation and application for gleaning insights from the data than ever before. Much is expected from algorithms without understanding their limitations at scale while dealing with massive datasets. In that context, we pose and address the following questions How does a machine learning algorithm perform on measures such as accuracy and execution time with increasing sample size and feature dimensionality? Does training with more samples guarantee better accuracy? How many features to compute for a given problem? Do more features guarantee better accuracy? Do efforts to derive and calculate more features and train on larger samples worth the effort? As problems become more complex and traditional binary classification algorithms are replaced with multi-task, multi-class categorization algorithms do parallel learners perform better? What happens to the accuracy of the learning algorithm when trained to categorize multiple classes within the same feature space? Towards finding answers to these questions,more » we describe the design of an empirical study and present the results. We conclude with the following observations (i) accuracy of the learning algorithm increases with increasing sample size but saturates at a point, beyond which more samples do not contribute to better accuracy/learning, (ii) the richness of the feature space dictates performance - both accuracy and training time, (iii) increased dimensionality often reflected in better performance (higher accuracy in spite of longer training times) but the improvements are not commensurate the efforts for feature computation and training and (iv) accuracy of the learning algorithms drop significantly with multi-class learners training on the same feature matrix and (v) learning algorithms perform well when categories in labeled data are independent (i.e., no relationship or hierarchy exists among categories).« less
 [1] ;  [1]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Publication Date:
OSTI Identifier:
Report Number(s):
DOE Contract Number:
Resource Type:
Technical Report
Research Org:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Org:
Country of Publication:
United States
96 KNOWLEDGE MANAGEMENT AND PRESERVATION; 97 MATHEMATICS AND COMPUTING machine learning; Big Data; Scalability; scale; limits of machine learning