Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Characterizing Large Text Corpora Using a Maximum Variation Sampling Genetic Algorithm

Conference ·

An enormous amount of information available via the Internet exists. Much of this data is in the form of text-based documents. These documents cover a variety of topics that are vitally important to the scientific, business, and defense/security communities. Currently, there are a many techniques for processing and analyzing such data. However, the ability to quickly characterize a large set of documents still proves challenging. Previous work has successfully demonstrated the use of a genetic algorithm for providing a representative subset for text documents via adaptive sampling. In this work, we further expand and explore this approach on much larger data sets using a parallel Genetic Algorithm (GA) with adaptive parameter control. Experimental results are presented and discussed.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
ORNL work for others
DOE Contract Number:
AC05-00OR22725
OSTI ID:
931452
Country of Publication:
United States
Language:
English

Similar Records

GPU-Accelerated Text Mining
Conference · Wed Dec 31 23:00:00 EST 2008 · OSTI ID:962625

DNA sequence assembly and genetic algorithms new results and puzzling insights
Technical Report · Sat Dec 30 23:00:00 EST 1995 · OSTI ID:401855

Feature Subset Selection, Class Separability, and Genetic Algorithms
Conference · Tue Jan 20 23:00:00 EST 2004 · OSTI ID:15013963

Related Subjects