skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable Visual Analytics of Massive Textual Datasets

Abstract

This paper describes the first scalable implementation of text processing engine used in Visual Analytics tools. These tools aid information analysts in interacting with and understanding large textual information content through visual interfaces. By developing parallel implementation of the text processing engine, we enabled visual analytics tools to exploit cluster architectures and handle massive dataset. The paper describes key elements of our parallelization approach and demonstrates virtually linear scaling when processing multi-gigabyte data sets such as Pubmed. This approach enables interactive analysis of large datasets beyond capabilities of existing state-of-the art visual analytics tools.

Authors:
; ; ; ;
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
908953
Report Number(s):
PNNL-SA-52302
TRN: US200722%%830
DOE Contract Number:
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: IPDPS 2007. IEEE International Parallel and Distributed Processing Symposium, 26-30 March 2007, Long Beach, CA, USA, 10 pages
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; INFORMATION SYSTEMS; DATA PROCESSING; DOCUMENT TYPES; PARALLEL PROCESSING; Visual Analytics; parallel processing

Citation Formats

Krishnan, Manoj Kumar, Bohn, Shawn J., Cowley, Wendy E., Crow, Vernon L., and Nieplocha, Jarek. Scalable Visual Analytics of Massive Textual Datasets. United States: N. p., 2007. Web. doi:10.1109/IPDPS.2007.370232.
Krishnan, Manoj Kumar, Bohn, Shawn J., Cowley, Wendy E., Crow, Vernon L., & Nieplocha, Jarek. Scalable Visual Analytics of Massive Textual Datasets. United States. doi:10.1109/IPDPS.2007.370232.
Krishnan, Manoj Kumar, Bohn, Shawn J., Cowley, Wendy E., Crow, Vernon L., and Nieplocha, Jarek. Sun . "Scalable Visual Analytics of Massive Textual Datasets". United States. doi:10.1109/IPDPS.2007.370232.
@article{osti_908953,
title = {Scalable Visual Analytics of Massive Textual Datasets},
author = {Krishnan, Manoj Kumar and Bohn, Shawn J. and Cowley, Wendy E. and Crow, Vernon L. and Nieplocha, Jarek},
abstractNote = {This paper describes the first scalable implementation of text processing engine used in Visual Analytics tools. These tools aid information analysts in interacting with and understanding large textual information content through visual interfaces. By developing parallel implementation of the text processing engine, we enabled visual analytics tools to exploit cluster architectures and handle massive dataset. The paper describes key elements of our parallelization approach and demonstrates virtually linear scaling when processing multi-gigabyte data sets such as Pubmed. This approach enables interactive analysis of large datasets beyond capabilities of existing state-of-the art visual analytics tools.},
doi = {10.1109/IPDPS.2007.370232},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sun Apr 01 00:00:00 EDT 2007},
month = {Sun Apr 01 00:00:00 EDT 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • n this paper, we present an overview of the big data chal- lenges in disease bio-surveillance and then discuss the use of visual analytics for integrating data and turning it into knowl- edge. We will explore two integration scenarios: (1) combining text and multimedia sources to improve situational awareness and (2) enhancing disease spread model data with real-time bio-surveillance data. Together, the proposed integration methodologies can improve awareness about when, where and how emerging diseases can affect wide geographic regions.
  • Visualization tools have been invaluable in the process of scientific discovery by providing researchers with insights gained through graphical tools and techniques. At PNL, the Multidimensional Visualization and Advanced Browsing (MVAB) project is extending visualization technology to the problems of intelligence analysis of textual documents by creating spatial representations of textual information. By representing an entire corpus of documents as points in a coordinate space of two or more dimensions, the tools developed by the MVAB team give the analyst the ability to quickly browse the entire document base and determine relationships among documents and publication patterns not readily discerniblemore » through traditional lexical means.« less
  • Understanding vector fields resulting from large scientific simulations is an important and often difficult task. Streamlines, curves that are tangential to a vector field at each point, are a powerful visualization method in this context. Application of streamline-based visualization to very large vector field data represents a significant challenge due to the non-local and data-dependent nature of streamline computation, and requires careful balancing of computational demands placed on I/O, memory, communication, and processors. In this paper we review two parallelization approaches based on established parallelization paradigms (static decomposition and on-demand loading) and present a novel hybrid algorithm for computing streamlines.more » Our algorithm is aimed at good scalability and performance across the widely varying computational characteristics of streamline-based problems. We perform performance and scalability studies of all three algorithms on a number of prototypical application problems and demonstrate that our hybrid scheme is able to perform well in different settings.« less
  • nderstanding vector fields resulting from large scientific simulations is an important and often difficult task. Streamlines, curves that are tangential to a ve ctor field at each point, are a powerful visualization method in this context. Application of streamline-based visualization to very large vector field data repr esents a significant challenge due to the non-local and data-dependent nature of streamline computation, and requires careful balancing of computational demands placed on I/O, memory, communication, and processors. In this paper we review two parallelization approaches based on established parallelization paradigms (stat ic decomposition and on-demand loading) and present a novel hybrid algorithmmore » for computing streamlines. Our algorithm is aimed at good scalability and performanc e across the widely varying computational characteristics of streamline-based problems. We perform performance and scalability studies of all three algorithms on a number of prototypical application problems and demonstrate that our hybrid scheme is able to perform well in different settings.« less