skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Machine Learning for Massive Scale Cosmology. Final Report


This report summarizes work done on applying machine learning algorithms in the domain of cosmology.

  1. Carnegie Mellon Univ., Pittsburgh, PA (United States)
Publication Date:
Research Org.:
Carnegie Mellon Univ., Pittsburgh, PA (United States)
Sponsoring Org.:
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
OSTI Identifier:
Report Number(s):
DOE Contract Number:
Resource Type:
Technical Report
Country of Publication:
United States
79 ASTRONOMY AND ASTROPHYSICS; Machine Learning Cosmology

Citation Formats

Schneider, Jeff. Machine Learning for Massive Scale Cosmology. Final Report. United States: N. p., 2017. Web. doi:10.2172/1355867.
Schneider, Jeff. Machine Learning for Massive Scale Cosmology. Final Report. United States. doi:10.2172/1355867.
Schneider, Jeff. 2017. "Machine Learning for Massive Scale Cosmology. Final Report". United States. doi:10.2172/1355867.
title = {Machine Learning for Massive Scale Cosmology. Final Report},
author = {Schneider, Jeff},
abstractNote = {This report summarizes work done on applying machine learning algorithms in the domain of cosmology.},
doi = {10.2172/1355867},
journal = {},
number = ,
volume = ,
place = {United States},
year = 2017,
month = 5

Technical Report:

Save / Share:
  • This report aims to empirically understand the limits of machine learning when applied to Big Data. We observe that recent innovations in being able to collect, access, organize, integrate, and query massive amounts of data from a wide variety of data sources have brought statistical data mining and machine learning under more scrutiny, evaluation and application for gleaning insights from the data than ever before. Much is expected from algorithms without understanding their limitations at scale while dealing with massive datasets. In that context, we pose and address the following questions How does a machine learning algorithm perform on measuresmore » such as accuracy and execution time with increasing sample size and feature dimensionality? Does training with more samples guarantee better accuracy? How many features to compute for a given problem? Do more features guarantee better accuracy? Do efforts to derive and calculate more features and train on larger samples worth the effort? As problems become more complex and traditional binary classification algorithms are replaced with multi-task, multi-class categorization algorithms do parallel learners perform better? What happens to the accuracy of the learning algorithm when trained to categorize multiple classes within the same feature space? Towards finding answers to these questions, we describe the design of an empirical study and present the results. We conclude with the following observations (i) accuracy of the learning algorithm increases with increasing sample size but saturates at a point, beyond which more samples do not contribute to better accuracy/learning, (ii) the richness of the feature space dictates performance - both accuracy and training time, (iii) increased dimensionality often reflected in better performance (higher accuracy in spite of longer training times) but the improvements are not commensurate the efforts for feature computation and training and (iv) accuracy of the learning algorithms drop significantly with multi-class learners training on the same feature matrix and (v) learning algorithms perform well when categories in labeled data are independent (i.e., no relationship or hierarchy exists among categories).« less
  • This white paper introduces the application of advanced data analytics to the modernized grid. In particular, we consider the field of machine learning and where it is both useful, and not useful, for the particular field of the distribution grid and buildings interface. While analytics, in general, is a growing field of interest, and often seen as the golden goose in the burgeoning distribution grid industry, its application is often limited by communications infrastructure, or lack of a focused technical application. Overall, the linkage of analytics to purposeful application in the grid space has been limited. In this paper wemore » consider the field of machine learning as a subset of analytical techniques, and discuss its ability and limitations to enable the future distribution grid and the building-to-grid interface. To that end, we also consider the potential for mixing distributed and centralized analytics and the pros and cons of these approaches. Machine learning is a subfield of computer science that studies and constructs algorithms that can learn from data and make predictions and improve forecasts. Incorporation of machine learning in grid monitoring and analysis tools may have the potential to solve data and operational challenges that result from increasing penetration of distributed and behind-the-meter energy resources. There is an exponentially expanding volume of measured data being generated on the distribution grid, which, with appropriate application of analytics, may be transformed into intelligible, actionable information that can be provided to the right actors – such as grid and building operators, at the appropriate time to enhance grid or building resilience, efficiency, and operations against various metrics or goals – such as total carbon reduction or other economic benefit to customers. While some basic analysis into these data streams can provide a wealth of information, computational and human boundaries on performing the analysis are becoming significant, with more data and multi-objective concerns. Efficient applications of analysis and the machine learning field are being considered in the loop.« less
  • The goal of the project was the development and demonstration of a significantly improved solar forecasting technology (short: Watt-sun), which leverages new big data processing technologies and machine-learnt blending between different models and forecast systems. The technology aimed demonstrating major advances in accuracy as measured by existing and new metrics which themselves were developed as part of this project. Finally, the team worked with Independent System Operators (ISOs) and utilities to integrate the forecasts into their operations.
  • Artificial-intelligence approaches to learning were reviewed for their potential contributions to the construction of a system to learn parameter-control doctrine. Separate learning tasks were isolated and several levels of related problems were distinguished. Formulas for providing the learning system with measures of its performance were derived for four kinds of targets.