Skip to main content
U.S. Department of Energy
Office of Scientific and Technical Information

Experiments in automatic word class and word sense identification for information retrieval

Technical Report ·
OSTI ID:68594
 [1];  [2]
  1. Univ. of Kansas, Lawrence, KS (United States)
  2. Northeastern Univ., Lawrence, KS (United States)

Automatic identification of related words and automatic detection of word senses are two long-standing goals of researchers in natural language processing. Word class information and word sense identification may enhance the performance of information retrieval system4ms. Large online corpora and increased computational capabilities make new techniques based on corpus linguisitics feasible. Corpus-based analysis is especially needed for corpora from specialized fields for which no electronic dictionaries or thesauri exist. The methods described here use a combination of mutual information and word context to establish word similarities. Then, unsupervised classification is done using clustering in the word space, identifying word classes without pretagging. We also describe an extension of the method to handle the difficult problems of disambiguation and of determining part-of-speech and semantic information for low-frequency words. The method is powerful enough to produce high-quality results on a small corpus of 200,000 words from abstracts in a field of molecular biology.

Research Organization:
Nevada Univ., Las Vegas, NV (United States)
OSTI ID:
68594
Report Number(s):
CONF-9404212--
Country of Publication:
United States
Language:
English

Similar Records

LEARNING SEMANTICS-ENHANCED LANGUAGE MODELS APPLIED TO UNSUEPRVISED WSD
Conference · Sun Jan 28 23:00:00 EST 2007 · OSTI ID:985889

Word prediction
Technical Report · Mon May 01 00:00:00 EDT 1995 · OSTI ID:123254

Word Domain Disambiguation via Word Sense Disambiguation
Conference · Sun Jun 04 00:00:00 EDT 2006 · OSTI ID:908504