skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Information Gain Based Dimensionality Selection for Classifying Text Documents

Conference ·

Selecting the optimal dimensions for various knowledge extraction applications is an essential component of data mining. Dimensionality selection techniques are utilized in classification applications to increase the classification accuracy and reduce the computational complexity. In text classification, where the dimensionality of the dataset is extremely high, dimensionality selection is even more important. This paper presents a novel, genetic algorithm based methodology, for dimensionality selection in text mining applications that utilizes information gain. The presented methodology uses information gain of each dimension to change the mutation probability of chromosomes dynamically. Since the information gain is calculated a priori, the computational complexity is not affected. The presented method was tested on a specific text classification problem and compared with conventional genetic algorithm based dimensionality selection. The results show an improvement of 3% in the true positives and 1.6% in the true negatives over conventional dimensionality selection methods.

Research Organization:
Idaho National Lab. (INL), Idaho Falls, ID (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
DE-AC07-05ID14517
OSTI ID:
1097139
Report Number(s):
INL/CON-13-28691
Resource Relation:
Conference: IEEE Congress on Evolutionary Computation ,Cancun, Mexico,06/20/2013,06/23/2013
Country of Publication:
United States
Language:
English