skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Experiments in automatic word class and word sense identification for information retrieval

Technical Report ·
OSTI ID:68594
 [1];  [2]
  1. Univ. of Kansas, Lawrence, KS (United States)
  2. Northeastern Univ., Lawrence, KS (United States)

Automatic identification of related words and automatic detection of word senses are two long-standing goals of researchers in natural language processing. Word class information and word sense identification may enhance the performance of information retrieval system4ms. Large online corpora and increased computational capabilities make new techniques based on corpus linguisitics feasible. Corpus-based analysis is especially needed for corpora from specialized fields for which no electronic dictionaries or thesauri exist. The methods described here use a combination of mutual information and word context to establish word similarities. Then, unsupervised classification is done using clustering in the word space, identifying word classes without pretagging. We also describe an extension of the method to handle the difficult problems of disambiguation and of determining part-of-speech and semantic information for low-frequency words. The method is powerful enough to produce high-quality results on a small corpus of 200,000 words from abstracts in a field of molecular biology.

Research Organization:
Nevada Univ., Las Vegas, NV (United States)
OSTI ID:
68594
Report Number(s):
CONF-9404212-; TRN: 95:004349-0034
Resource Relation:
Conference: 3. annual symposium on document analysis and information retrieval, Las Vegas, NV (United States), 11-13 Apr 1994; Other Information: PBD: 1994; Related Information: Is Part Of Third Annual Symposium on Document Analysis and Information Retrieval; PB: 484 p.
Country of Publication:
United States
Language:
English