# Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.

## Abstract

The purpose of this project was to develop sampling-based algorithms to discover hidden struc- ture in massive data sets. Inferring structure in large data sets is an increasingly common task in many critical national security applications. These data sets come from myriad sources, such as network traffic, sensor data, and data generated by large-scale simulations. They are often so large that traditional data mining techniques are time consuming or even infeasible. To address this problem, we focus on a class of algorithms that do not compute an exact answer, but instead use sampling to compute an approximate answer using fewer resources. The particular class of algorithms that we focus on are streaming algorithms , so called because they are designed to handle high-throughput streams of data. Streaming algorithms have only a small amount of working storage - much less than the size of the full data stream - so they must necessarily use sampling to approximate the correct answer. We present two results: * A streaming algorithm called HyperHeadTail , that estimates the degree distribution of a graph (i.e., the distribution of the number of connections for each node in a network). The degree distribution is a fundamental graph property,more »

- Authors:

- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

- Publication Date:

- Research Org.:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

- Sponsoring Org.:
- USDOE National Nuclear Security Administration (NNSA)

- OSTI Identifier:
- 1367498

- Report Number(s):
- SAND-2017-1475

654224

- DOE Contract Number:
- AC04-94AL85000

- Resource Type:
- Technical Report

- Country of Publication:
- United States

- Language:
- English

- Subject:
- 97 MATHEMATICS AND COMPUTING

### Citation Formats

```
Matulef, Kevin Michael.
```*Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.*. United States: N. p., 2017.
Web. doi:10.2172/1367498.

```
Matulef, Kevin Michael.
```*Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.*. United States. doi:10.2172/1367498.

```
Matulef, Kevin Michael. Wed .
"Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.". United States.
doi:10.2172/1367498. https://www.osti.gov/servlets/purl/1367498.
```

```
@article{osti_1367498,
```

title = {Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.},

author = {Matulef, Kevin Michael},

abstractNote = {The purpose of this project was to develop sampling-based algorithms to discover hidden struc- ture in massive data sets. Inferring structure in large data sets is an increasingly common task in many critical national security applications. These data sets come from myriad sources, such as network traffic, sensor data, and data generated by large-scale simulations. They are often so large that traditional data mining techniques are time consuming or even infeasible. To address this problem, we focus on a class of algorithms that do not compute an exact answer, but instead use sampling to compute an approximate answer using fewer resources. The particular class of algorithms that we focus on are streaming algorithms , so called because they are designed to handle high-throughput streams of data. Streaming algorithms have only a small amount of working storage - much less than the size of the full data stream - so they must necessarily use sampling to approximate the correct answer. We present two results: * A streaming algorithm called HyperHeadTail , that estimates the degree distribution of a graph (i.e., the distribution of the number of connections for each node in a network). The degree distribution is a fundamental graph property, but prior work on estimating the degree distribution in a streaming setting was impractical for many real-world application. We improve upon prior work by developing an algorithm that can handle streams with repeated edges, and graph structures that evolve over time. * An algorithm for the task of maintaining a weighted subsample of items in a stream, when the items must be sampled according to their weight, and the weights are dynamically changing. To our knowledge, this is the first such algorithm designed for dynamically evolving weights. We expect it may be useful as a building block for other streaming algorithms on dynamic data sets.},

doi = {10.2172/1367498},

journal = {},

number = ,

volume = ,

place = {United States},

year = {Wed Feb 01 00:00:00 EST 2017},

month = {Wed Feb 01 00:00:00 EST 2017}

}