Summary: A New Text Categorization Technique Using
Distributional Clustering and Learning Logic
Hisham Al-Mubaid and Syed A. Umair
Abstract--Text categorization is continuing to be one of the most researched NLP problems due to the ever-increasing amounts of
electronic documents and digital libraries. In this paper, we present a new text categorization method that combines the distributional
clustering of words and a learning logic technique, called Lsquare, for constructing text classifiers. The high dimensionality of text in a
document has not been fruitful for the task of categorization, for which reason, feature clustering has been proven to be an ideal
alternative to feature selection for reducing the dimensionality. We, therefore, use distributional clustering method (IB) to generate an
efficient representation of documents and apply Lsquare for training text classifiers. The method was extensively tested and evaluated.
The proposed method achieves higher or comparable classification accuracy and F1 results compared with SVM on exact
experimental settings with a small number of training documents on three benchmark data sets WebKB, 20Newsgroup, and Reuters-
21578. The results prove that the method is a good choice for applications with a limited amount of labeled training data. We also
demonstrate the effect of changing training size on the classification performance of the learners.
Index Terms--Text categorization, feature selection, machine learning.
TEXT Categorization (TC) is the task of assigning a given
text document to one or more predefined categories.
This problem has received a special and increased attention
from researchers in the past few decades due to many