ParaText : scalable text modeling and analysis.
Abstract
Automated processing, modeling, and analysis of unstructured text (news documents, web content, journal articles, etc.) is a key task in many data analysis and decision making applications. As data sizes grow, scalability is essential for deep analysis. In many cases, documents are modeled as term or feature vectors and latent semantic analysis (LSA) is used to model latent, or hidden, relationships between documents and terms appearing in those documents. LSA supplies conceptual organization and analysis of document collections by modeling high-dimension feature vectors in many fewer dimensions. While past work on the scalability of LSA modeling has focused on the SVD, the goal of our work is to investigate the use of distributed memory architectures for the entire text analysis process, from data ingestion to semantic modeling and analysis. ParaText is a set of software components for distributed processing, modeling, and analysis of unstructured text. The ParaText source code is available under a BSD license, as an integral part of the Titan toolkit. ParaText components are chained-together into data-parallel pipelines that are replicated across processes on distributed-memory architectures. Individual components can be replaced or rewired to explore different computational strategies and implement new functionality. ParaText functionality can be embedded inmore »
- Authors:
- Publication Date:
- Research Org.:
- Sandia National Laboratories (SNL), Albuquerque, NM, and Livermore, CA (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1020434
- Report Number(s):
- SAND2010-3682C
TRN: US201116%%326
- DOE Contract Number:
- AC04-94AL85000
- Resource Type:
- Conference
- Resource Relation:
- Conference: Proposed for presentation at the 2010 ACM International Symposium on High Performance Distributed Computing (HPDC) held June 21-25, 2010 in Chicago, IL.
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; COMPUTERS; CONFIGURATION; DATA ANALYSIS; DECISION MAKING; DIMENSIONS; INGESTION; JAVA; PERFORMANCE; PIPELINES; PROCESSING; SIMULATION; VECTORS
Citation Formats
Dunlavy, Daniel M, Stanton, Eric T, and Shead, Timothy M. ParaText : scalable text modeling and analysis.. United States: N. p., 2010.
Web.
Dunlavy, Daniel M, Stanton, Eric T, & Shead, Timothy M. ParaText : scalable text modeling and analysis.. United States.
Dunlavy, Daniel M, Stanton, Eric T, and Shead, Timothy M. 2010.
"ParaText : scalable text modeling and analysis.". United States.
@article{osti_1020434,
title = {ParaText : scalable text modeling and analysis.},
author = {Dunlavy, Daniel M and Stanton, Eric T and Shead, Timothy M},
abstractNote = {Automated processing, modeling, and analysis of unstructured text (news documents, web content, journal articles, etc.) is a key task in many data analysis and decision making applications. As data sizes grow, scalability is essential for deep analysis. In many cases, documents are modeled as term or feature vectors and latent semantic analysis (LSA) is used to model latent, or hidden, relationships between documents and terms appearing in those documents. LSA supplies conceptual organization and analysis of document collections by modeling high-dimension feature vectors in many fewer dimensions. While past work on the scalability of LSA modeling has focused on the SVD, the goal of our work is to investigate the use of distributed memory architectures for the entire text analysis process, from data ingestion to semantic modeling and analysis. ParaText is a set of software components for distributed processing, modeling, and analysis of unstructured text. The ParaText source code is available under a BSD license, as an integral part of the Titan toolkit. ParaText components are chained-together into data-parallel pipelines that are replicated across processes on distributed-memory architectures. Individual components can be replaced or rewired to explore different computational strategies and implement new functionality. ParaText functionality can be embedded in applications on any platform using the native C++ API, Python, or Java. The ParaText MPI Process provides a 'generic' text analysis pipeline in a command-line executable that can be used for many serial and parallel analysis tasks. ParaText can also be deployed as a web service accessible via a RESTful (HTTP) API. In the web service configuration, any client can access the functionality provided by ParaText using commodity protocols ... from standard web browsers to custom clients written in any language.},
doi = {},
url = {https://www.osti.gov/biblio/1020434},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Jun 01 00:00:00 EDT 2010},
month = {Tue Jun 01 00:00:00 EDT 2010}
}