skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Characterizing Large Text Corpora Using a Maximum Variation Sampling Genetic Algorithm

Abstract

An enormous amount of information available via the Internet exists. Much of this data is in the form of text-based documents. These documents cover a variety of topics that are vitally important to the scientific, business, and defense/security communities. Currently, there are a many techniques for processing and analyzing such data. However, the ability to quickly characterize a large set of documents still proves challenging. Previous work has successfully demonstrated the use of a genetic algorithm for providing a representative subset for text documents via adaptive sampling. In this work, we further expand and explore this approach on much larger data sets using a parallel Genetic Algorithm (GA) with adaptive parameter control. Experimental results are presented and discussed.

Authors:
 [1]
  1. ORNL
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
Work for Others (WFO)
OSTI Identifier:
931452
DOE Contract Number:
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: Genetic and Evolutionary Computation Conference, Seattle, WA, USA, 20060708, 20060712
Country of Publication:
United States
Language:
English

Citation Formats

Patton, Robert M. Characterizing Large Text Corpora Using a Maximum Variation Sampling Genetic Algorithm. United States: N. p., 2006. Web. doi:10.1145/1143997.1144308.
Patton, Robert M. Characterizing Large Text Corpora Using a Maximum Variation Sampling Genetic Algorithm. United States. doi:10.1145/1143997.1144308.
Patton, Robert M. Sun . "Characterizing Large Text Corpora Using a Maximum Variation Sampling Genetic Algorithm". United States. doi:10.1145/1143997.1144308.
@article{osti_931452,
title = {Characterizing Large Text Corpora Using a Maximum Variation Sampling Genetic Algorithm},
author = {Patton, Robert M},
abstractNote = {An enormous amount of information available via the Internet exists. Much of this data is in the form of text-based documents. These documents cover a variety of topics that are vitally important to the scientific, business, and defense/security communities. Currently, there are a many techniques for processing and analyzing such data. However, the ability to quickly characterize a large set of documents still proves challenging. Previous work has successfully demonstrated the use of a genetic algorithm for providing a representative subset for text documents via adaptive sampling. In this work, we further expand and explore this approach on much larger data sets using a parallel Genetic Algorithm (GA) with adaptive parameter control. Experimental results are presented and discussed.},
doi = {10.1145/1143997.1144308},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Sun Jan 01 00:00:00 EST 2006},
month = {Sun Jan 01 00:00:00 EST 2006}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • Abstract not provided.
  • No abstract prepared.
  • From a management perspective, understanding the information that exists on a network and how it is distributed provides a critical advantage. This work explores the use of topic modeling as an approach to automatically determine the classes of information that exist on an organization's network, and then use the resultant topics as centroid vectors for the classification of individual documents in order to understand the distribution of information topics across the enterprise network. The approach is tested using the 20 Newsgroups dataset.
  • This paper reports on results of an application of a genetic algorithm to the optimization of a large UK Coal Mine Ventilation Network. The genetic algorithm technique has been developed into a computer program for minimizing the total network operating fan power costs. The application of booster fans may become an attractive alternative for ventilation engineers to provide an adequate supply of fresh air around the working areas in some deep and/or extensive mines. The objective of this research is to minimize the total power consumption of a ventilation system by determining the optimum combinations of (1) main fan andmore » booster fans ratings and (2) booster fan position(s). A modular computer program, which combines the application of the genetic algorithm optimization technique together with a ventilation network simulator, has been developed using the C++ language. The ventilation network simulator uses the standard hardy-cross iterative scheme implicit within the VNET mine ventilation software that was developed at the University of Nottingham. This paper presents detail of a study on an extensive UK coal mine ventilation network. The ventilation of this network is investigated using various configurations--a single main surface fan, or a main surface fan with either a single, two or three underground booster fans. The paper highlights the major genetic operators that are used to evolve the optimum solution. It is concluded that the genetic algorithm approach is an efficient and flexible solution method.« less