Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Machine Learning for Big Data: A Study to Understand Limits at Scale

Technical Report ·
DOI:https://doi.org/10.2172/1234336· OSTI ID:1234336
 [1];  [1]
  1. Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
This report aims to empirically understand the limits of machine learning when applied to Big Data. We observe that recent innovations in being able to collect, access, organize, integrate, and query massive amounts of data from a wide variety of data sources have brought statistical data mining and machine learning under more scrutiny, evaluation and application for gleaning insights from the data than ever before. Much is expected from algorithms without understanding their limitations at scale while dealing with massive datasets. In that context, we pose and address the following questions How does a machine learning algorithm perform on measures such as accuracy and execution time with increasing sample size and feature dimensionality? Does training with more samples guarantee better accuracy? How many features to compute for a given problem? Do more features guarantee better accuracy? Do efforts to derive and calculate more features and train on larger samples worth the effort? As problems become more complex and traditional binary classification algorithms are replaced with multi-task, multi-class categorization algorithms do parallel learners perform better? What happens to the accuracy of the learning algorithm when trained to categorize multiple classes within the same feature space? Towards finding answers to these questions, we describe the design of an empirical study and present the results. We conclude with the following observations (i) accuracy of the learning algorithm increases with increasing sample size but saturates at a point, beyond which more samples do not contribute to better accuracy/learning, (ii) the richness of the feature space dictates performance - both accuracy and training time, (iii) increased dimensionality often reflected in better performance (higher accuracy in spite of longer training times) but the improvements are not commensurate the efforts for feature computation and training and (iv) accuracy of the learning algorithms drop significantly with multi-class learners training on the same feature matrix and (v) learning algorithms perform well when categories in labeled data are independent (i.e., no relationship or hierarchy exists among categories).
Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC05-00OR22725
OSTI ID:
1234336
Report Number(s):
ORNL/TM--2015/344
Country of Publication:
United States
Language:
English

Similar Records

A phase transition for finding needles in nonlinear haystacks with LASSO artificial neural networks
Journal Article · Fri Oct 21 20:00:00 EDT 2022 · Statistics and Computing · OSTI ID:2469624

Few measurement shots challenge generalization in learning to classify entanglement
Journal Article · Sat Nov 09 19:00:00 EST 2024 · No journal information · OSTI ID:2504207

Scaling up: Distributed machine learning with cooperation
Conference · Mon Dec 30 23:00:00 EST 1996 · OSTI ID:430637