skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Discovering Event Evidence Amid Massive, Dynamic Datasets

Abstract

Automated event extraction remains a very difficult challenge requiring information analysts to manually identify key events of interest within massive, dynamic data. Many techniques for extracting events rely on domain specific natural language processing or information retrieval techniques. As an alternative, this work focuses on detecting events based on identifying event characteristics of interest to an analyst. An evolutionary algorithm is developed as a proof of concept to demonstrate this approach. Initial results indicate that this approach represents a feasible approach to identifying critical event information in a massive data set with no apriori knowledgeof the data set.

Authors:
 [1];  [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
Work for Others (WFO)
OSTI Identifier:
931802
DOE Contract Number:
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: Genetic and Evolutionary Computation Conference, London, United Kingdom, 20070707, 20070711
Country of Publication:
United States
Language:
English

Citation Formats

Patton, Robert M, and Potok, Thomas E. Discovering Event Evidence Amid Massive, Dynamic Datasets. United States: N. p., 2007. Web. doi:10.1145/1274000.1274033.
Patton, Robert M, & Potok, Thomas E. Discovering Event Evidence Amid Massive, Dynamic Datasets. United States. doi:10.1145/1274000.1274033.
Patton, Robert M, and Potok, Thomas E. Mon . "Discovering Event Evidence Amid Massive, Dynamic Datasets". United States. doi:10.1145/1274000.1274033.
@article{osti_931802,
title = {Discovering Event Evidence Amid Massive, Dynamic Datasets},
author = {Patton, Robert M and Potok, Thomas E},
abstractNote = {Automated event extraction remains a very difficult challenge requiring information analysts to manually identify key events of interest within massive, dynamic data. Many techniques for extracting events rely on domain specific natural language processing or information retrieval techniques. As an alternative, this work focuses on detecting events based on identifying event characteristics of interest to an analyst. An evolutionary algorithm is developed as a proof of concept to demonstrate this approach. Initial results indicate that this approach represents a feasible approach to identifying critical event information in a massive data set with no apriori knowledgeof the data set.},
doi = {10.1145/1274000.1274033},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Jan 01 00:00:00 EST 2007},
month = {Mon Jan 01 00:00:00 EST 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • This paper describes the first scalable implementation of text processing engine used in Visual Analytics tools. These tools aid information analysts in interacting with and understanding large textual information content through visual interfaces. By developing parallel implementation of the text processing engine, we enabled visual analytics tools to exploit cluster architectures and handle massive dataset. The paper describes key elements of our parallelization approach and demonstrates virtually linear scaling when processing multi-gigabyte data sets such as Pubmed. This approach enables interactive analysis of large datasets beyond capabilities of existing state-of-the art visual analytics tools.
  • We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor.more » For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less
  • The size of datasets produced by current climate models is increasing rapidly to the scale of petabytes. To handle data at this scale parallel analysis tools are required, however the majority of climate analysis software remains at the scale of workstations. Further, many climate analysis tools adequately process regularly gridded data but lack sufficient features when handling unstructured grids. This paper presents a data-parallel subsetter capable of correctly handling unstructured grids while scaling to over 2000 cores. The approach is based on the partitioned global address space (PGAS) parallel programming model and one-sided communication. The paper demonstrates that IO remainsmore » the single greatest bottleneck for this domain of applications and that parallel analysis of climate data succeeds in practice.« less
  • We illustrate the use of a computational framework for applying non-parametric statistical methods to the comparison of massive spatiotemporal datasets within a distributed computing environment.