skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.

Abstract

The purpose of this project was to develop sampling-based algorithms to discover hidden struc- ture in massive data sets. Inferring structure in large data sets is an increasingly common task in many critical national security applications. These data sets come from myriad sources, such as network traffic, sensor data, and data generated by large-scale simulations. They are often so large that traditional data mining techniques are time consuming or even infeasible. To address this problem, we focus on a class of algorithms that do not compute an exact answer, but instead use sampling to compute an approximate answer using fewer resources. The particular class of algorithms that we focus on are streaming algorithms , so called because they are designed to handle high-throughput streams of data. Streaming algorithms have only a small amount of working storage - much less than the size of the full data stream - so they must necessarily use sampling to approximate the correct answer. We present two results: * A streaming algorithm called HyperHeadTail , that estimates the degree distribution of a graph (i.e., the distribution of the number of connections for each node in a network). The degree distribution is a fundamental graph property,more » but prior work on estimating the degree distribution in a streaming setting was impractical for many real-world application. We improve upon prior work by developing an algorithm that can handle streams with repeated edges, and graph structures that evolve over time. * An algorithm for the task of maintaining a weighted subsample of items in a stream, when the items must be sampled according to their weight, and the weights are dynamically changing. To our knowledge, this is the first such algorithm designed for dynamically evolving weights. We expect it may be useful as a building block for other streaming algorithms on dynamic data sets.« less

Authors:
 [1]
  1. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1367498
Report Number(s):
SAND-2017-1475
654224
DOE Contract Number:
AC04-94AL85000
Resource Type:
Technical Report
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Matulef, Kevin Michael. Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.. United States: N. p., 2017. Web. doi:10.2172/1367498.
Matulef, Kevin Michael. Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.. United States. doi:10.2172/1367498.
Matulef, Kevin Michael. Wed . "Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.". United States. doi:10.2172/1367498. https://www.osti.gov/servlets/purl/1367498.
@article{osti_1367498,
title = {Final Report: Sampling-Based Algorithms for Estimating Structure in Big Data.},
author = {Matulef, Kevin Michael},
abstractNote = {The purpose of this project was to develop sampling-based algorithms to discover hidden struc- ture in massive data sets. Inferring structure in large data sets is an increasingly common task in many critical national security applications. These data sets come from myriad sources, such as network traffic, sensor data, and data generated by large-scale simulations. They are often so large that traditional data mining techniques are time consuming or even infeasible. To address this problem, we focus on a class of algorithms that do not compute an exact answer, but instead use sampling to compute an approximate answer using fewer resources. The particular class of algorithms that we focus on are streaming algorithms , so called because they are designed to handle high-throughput streams of data. Streaming algorithms have only a small amount of working storage - much less than the size of the full data stream - so they must necessarily use sampling to approximate the correct answer. We present two results: * A streaming algorithm called HyperHeadTail , that estimates the degree distribution of a graph (i.e., the distribution of the number of connections for each node in a network). The degree distribution is a fundamental graph property, but prior work on estimating the degree distribution in a streaming setting was impractical for many real-world application. We improve upon prior work by developing an algorithm that can handle streams with repeated edges, and graph structures that evolve over time. * An algorithm for the task of maintaining a weighted subsample of items in a stream, when the items must be sampled according to their weight, and the weights are dynamically changing. To our knowledge, this is the first such algorithm designed for dynamically evolving weights. We expect it may be useful as a building block for other streaming algorithms on dynamic data sets.},
doi = {10.2172/1367498},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Wed Feb 01 00:00:00 EST 2017},
month = {Wed Feb 01 00:00:00 EST 2017}
}

Technical Report:

Save / Share:
  • The common theme of this dissertation is sampling-based motion planning with the two key contributions being in the area of replanning and spatial load balancing for robotic systems. Here, we begin by recalling two sampling-based motion planners: the asymptotically optimal rapidly-exploring random tree (RRT*), and the asymptotically optimal probabilistic roadmap (PRM*). We also provide a brief background on collision cones and the Distributed Reactive Collision Avoidance (DRCA) algorithm. The next four chapters detail novel contributions for motion replanning in environments with unexpected static obstacles, for multi-agent collision avoidance, and spatial load balancing. First, we show improved performance of the RRT*more » when using the proposed Grandparent-Connection (GP) or Focused-Refinement (FR) algorithms. Next, the Goal Tree algorithm for replanning with unexpected static obstacles is detailed and proven to be asymptotically optimal. A multi-agent collision avoidance problem in obstacle environments is approached via the RRT*, leading to the novel Sampling-Based Collision Avoidance (SBCA) algorithm. The SBCA algorithm is proven to guarantee collision free trajectories for all of the agents, even when subject to uncertainties in the knowledge of the other agents’ positions and velocities. Given that a solution exists, we prove that livelocks and deadlock will lead to the cost to the goal being decreased. We introduce a new deconfliction maneuver that decreases the cost-to-come at each step. This new maneuver removes the possibility of livelocks and allows a result to be formed that proves convergence to the goal configurations. Finally, we present a limited range Graph-based Spatial Load Balancing (GSLB) algorithm which fairly divides a non-convex space among multiple agents that are subject to differential constraints and have a limited travel distance. The GSLB is proven to converge to a solution when maximizing the area covered by the agents. The analysis for each of the above mentioned algorithms is confirmed in simulations.« less
  • Quantum mechanical ab initio calculation constitutes the biggest portion of the computer time in material science and chemical science simulations. As a computer center like NERSC, to better serve these communities, it will be very useful to have a prediction for the future trends of ab initio calculations in these areas. Such prediction can help us to decide what future computer architecture can be most useful for these communities, and what should be emphasized on in future supercomputer procurement. As the size of the computer and the size of the simulated physical systems increase, there is a renewed interest inmore » using the real space grid method in electronic structure calculations. This is fueled by two factors. First, it is generally assumed that the real space grid method is more suitable for parallel computation for its limited communication requirement, compared with spectrum method where a global FFT is required. Second, as the size N of the calculated system increases together with the computer power, O(N) scaling approaches become more favorable than the traditional direct O(N{sup 3}) scaling methods. These O(N) methods are usually based on localized orbital in real space, which can be described more naturally by the real space basis. In this report, the author compares the real space methods versus the traditional plane wave (PW) spectrum methods, for their technical pros and cons, and the possible of future trends. For the real space method, the author focuses on the regular grid finite different (FD) method and the finite element (FE) method. These are the methods used mostly in material science simulation. As for chemical science, the predominant methods are still Gaussian basis method, and sometime the atomic orbital basis method. These two basis sets are localized in real space, and there is no indication that their roles in quantum chemical simulation will change anytime soon. The author focuses on the density functional theory (DFT), which is the most used method for quantum mechanical material science simulation.« less
  • A probability sampling method called SALT (Selection At List Time) has been developed for collecting and summarizing data on delivery of suspended sediment in rivers. It is based on sampling and estimating yield using a suspended-sediment rating curve for high discharges and simple random sampling for low flows. The method gives unbiased estimates of total yield and variance. The technique has been modified by replacing the rating curve with a user-specified average sampling rate function. This function allows easier specification of field sampling parameters for specified conditions and helps avoid the extremes of data collection. It also improves the distributionmore » of samples if the intent is to estimate suspended sediment yield during storms specified after data collection. This form of SALT sampling is called piecewise SALT sampling.« less
  • Nuclear reactors in the United States account for roughly 20% of the nation's total electric energy generation, and maintaining their safety in regards to key component structural integrity is critical not only for long term use of such plants but also for the safety of personnel and the public living around the plant. Early detection of damage signature such as of stress corrosion cracking, thermal-mechanical loading related material degradation in safety-critical components is a necessary requirement for long-term and safe operation of nuclear power plant systems.
  • The report describes the site specific changes required to convert an existing lime FGD system to a limestone system enhanced by dibasic acid (DBA) or adipic acid, and the costs of making such a change. In 1982-83, pilot plant tests were conducted at the R. D. Green Station of Big Rivers Electric Corporation (BREC). The final report of the pilot testing included comparisons of the operating costs of a lime-based full-size absorber, to those of a retrofit limestone system enhanced with DBA or adipic acid. Results of this analysis indicate that an annual cost savings of $2.6 million could bemore » achieved by converting the existing BREC lime system to an adipic-acid-enhanced limestone system, and an annual savings of $3.1 million could be achieved by converting to a DBA-enhanced system.« less