skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Visual Analysis of Text Document Collections

Abstract

The volume of information, communications, and descriptions provided in text form is large and increasing. One of the most often used software application of our time, web page retrieval based on key word descriptions, can be constructed as a text analysis application. The volume and diversity of information available in text data sources has driven the development of a variety of methods for interacting with, and presenting the results from, text analyses. In short, text analysis provides a challenging, important area for statistical analysis and application. Existing text analysis systems and technologies are reviewed. Capabilities of the technology are described, including potential for scaling and analytic activities directly supported analytic activities that could be supported and unmet analytic needs. Statistics-related technologies that are contained in text visualization systems are identified. Choices and trade-offs made in text visualization systems are indicated, as are some areas of research and potential development.

Authors:
Publication Date:
Research Org.:
Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1092725
Report Number(s):
PNNL-SA-47474
DOE Contract Number:
AC05-76RL01830
Resource Type:
Conference
Resource Relation:
Conference: Proceedings of the 2005 Joint Statistical Meetings, August 7-11, 2005, Minneapolis, Minnesota
Country of Publication:
United States
Language:
English
Subject:
visualization; text analysis; statistics

Citation Formats

Whitney, Paul D. Visual Analysis of Text Document Collections. United States: N. p., 2005. Web.
Whitney, Paul D. Visual Analysis of Text Document Collections. United States.
Whitney, Paul D. Wed . "Visual Analysis of Text Document Collections". United States. doi:.
@article{osti_1092725,
title = {Visual Analysis of Text Document Collections},
author = {Whitney, Paul D.},
abstractNote = {The volume of information, communications, and descriptions provided in text form is large and increasing. One of the most often used software application of our time, web page retrieval based on key word descriptions, can be constructed as a text analysis application. The volume and diversity of information available in text data sources has driven the development of a variety of methods for interacting with, and presenting the results from, text analyses. In short, text analysis provides a challenging, important area for statistical analysis and application. Existing text analysis systems and technologies are reviewed. Capabilities of the technology are described, including potential for scaling and analytic activities directly supported analytic activities that could be supported and unmet analytic needs. Statistics-related technologies that are contained in text visualization systems are identified. Choices and trade-offs made in text visualization systems are indicated, as are some areas of research and potential development.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Wed Nov 30 00:00:00 EST 2005},
month = {Wed Nov 30 00:00:00 EST 2005}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • This paper demonstrates the promise of augmenting interactive visualizations with semi-supervised machine learning techniques to improve the discovery of significant associations and insight in the search and analysis of textual information. More specifically, we have developed a system called Gryffin that hosts a unique collection of techniques that facilitate individualized investigative search pertaining to an ever-changing set of analytical questions over an indexed collection of open-source publications related to national infrastructure. The Gryffin client hosts dynamic displays of the search results via focus+context record listings, temporal timelines, term- frequency views, and multiple coordinated views. Furthermore, as the analyst interacts withmore » the display, the interactions are recorded and used to label the search records. These labeled records are then used to drive semi-supervised machine learning algorithms that re-rank the unlabeled search records such that potentially relevant records are moved to the top of the record listing. Gryffin is described in the context of the daily tasks encountered at the Department of Homeland Securitys Fusion Centers, with whom we are collaborating in its development. The resulting system is capable of addressing the analysts information overload that can be directly attributed to the deluge of information that must be addressed in search and investigative analysis of textual information.« less
  • The scale, velocity, and dynamic nature of large scale social media systems like Twitter demand a new set of visual analytics techniques that support near real-time situational awareness. Social media systems are credited with escalating social protest during recent large scale riots. Virtual communities form rapidly in these online systems, and they occasionally foster violence and unrest which is conveyed in the users language. Techniques for analyzing broad trends over these networks or reconstructing conversations within small groups have been demonstrated in recent years, but state-of- the-art tools are inadequate at supporting near real-time analysis of these high throughput streamsmore » of unstructured information. In this paper, we present an adaptive system to discover and interactively explore these virtual networks, as well as detect sentiment, highlight change, and discover spatio- temporal patterns.« less
  • In this paper, we introduce a new visual analytics system, called Matisse, that allows exploration of global trends in textual information streams with specific application to social media platforms. Despite the potential for real-time situational awareness using these services, interactive analysis of such semi-structured textual information is a challenge due to the high-throughput and high-velocity properties. Matisse addresses these challenges through the following contributions: (1) robust stream data management, (2) automated sen- timent/emotion analytics, (3) inferential temporal, geospatial, and term-frequency visualizations, and (4) a flexible drill-down interaction scheme that progresses from macroscale to microscale views. In addition to describing thesemore » contributions, our work-in-progress paper concludes with a practical case study focused on the analysis of Twitter 1% sample stream information captured during the week of the Boston Marathon bombings.« less
  • Visualization tools have been invaluable in the process of scientific discovery by providing researchers with insights gained through graphical tools and techniques. At PNL, the Multidimensional Visualization and Advanced Browsing (MVAB) project is extending visualization technology to the problems of intelligence analysis of textual documents by creating spatial representations of textual information. By representing an entire corpus of documents as points in a coordinate space of two or more dimensions, the tools developed by the MVAB team give the analyst the ability to quickly browse the entire document base and determine relationships among documents and publication patterns not readily discerniblemore » through traditional lexical means.« less
  • Faceted classifications of text collections provide a useful means of partitioning documents into related groups, however traditional approaches of faceting text collections rely on comprehensive analysis of the subject area or annotated general attributes. In this paper we show the application of basic principles for facet analysis to the development of computational methods for facet classification of text collections. Integration with a visual analytics system is described with summaries of user experiences.