skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.

Abstract

The purpose of this project was to develop sampling-based algorithms to discover hidden struc- ture in massive data sets. Inferring structure in large data sets is an increasingly common task in many critical national security applications. These data sets come from myriad sources, such as network traffic, sensor data, and data generated by large-scale simulations. They are often so large that traditional data mining techniques are time consuming or even infeasible. To address this problem, we focus on a class of algorithms that do not compute an exact answer, but instead use sampling to compute an approximate answer using fewer resources. The particular class of algorithms that we focus on are streaming algorithms , so called because they are designed to handle high-throughput streams of data. Streaming algorithms have only a small amount of working storage - much less than the size of the full data stream - so they must necessarily use sampling to approximate the correct answer. We present two results: * A streaming algorithm called HyperHeadTail , that estimates the degree distribution of a graph (i.e., the distribution of the number of connections for each node in a network). The degree distribution is a fundamental graph property,more » but prior work on estimating the degree distribution in a streaming setting was impractical for many real-world application. We improve upon prior work by developing an algorithm that can handle streams with repeated edges, and graph structures that evolve over time. * An algorithm for the task of maintaining a weighted subsample of items in a stream, when the items must be sampled according to their weight, and the weights are dynamically changing. To our knowledge, this is the first such algorithm designed for dynamically evolving weights. We expect it may be useful as a building block for other streaming algorithms on dynamic data sets.« less

Authors:
 [1]
  1. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1367498
Report Number(s):
SAND-2017-1475
654224
DOE Contract Number:  
AC04-94AL85000
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Matulef, Kevin Michael. Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.. United States: N. p., 2017. Web. doi:10.2172/1367498.
Matulef, Kevin Michael. Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.. United States. doi:10.2172/1367498.
Matulef, Kevin Michael. Wed . "Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.". United States. doi:10.2172/1367498. https://www.osti.gov/servlets/purl/1367498.
@article{osti_1367498,
title = {Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.},
author = {Matulef, Kevin Michael},
abstractNote = {The purpose of this project was to develop sampling-based algorithms to discover hidden struc- ture in massive data sets. Inferring structure in large data sets is an increasingly common task in many critical national security applications. These data sets come from myriad sources, such as network traffic, sensor data, and data generated by large-scale simulations. They are often so large that traditional data mining techniques are time consuming or even infeasible. To address this problem, we focus on a class of algorithms that do not compute an exact answer, but instead use sampling to compute an approximate answer using fewer resources. The particular class of algorithms that we focus on are streaming algorithms , so called because they are designed to handle high-throughput streams of data. Streaming algorithms have only a small amount of working storage - much less than the size of the full data stream - so they must necessarily use sampling to approximate the correct answer. We present two results: * A streaming algorithm called HyperHeadTail , that estimates the degree distribution of a graph (i.e., the distribution of the number of connections for each node in a network). The degree distribution is a fundamental graph property, but prior work on estimating the degree distribution in a streaming setting was impractical for many real-world application. We improve upon prior work by developing an algorithm that can handle streams with repeated edges, and graph structures that evolve over time. * An algorithm for the task of maintaining a weighted subsample of items in a stream, when the items must be sampled according to their weight, and the weights are dynamically changing. To our knowledge, this is the first such algorithm designed for dynamically evolving weights. We expect it may be useful as a building block for other streaming algorithms on dynamic data sets.},
doi = {10.2172/1367498},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Wed Feb 01 00:00:00 EST 2017},
month = {Wed Feb 01 00:00:00 EST 2017}
}

Technical Report:

Save / Share: