skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Latent morpho-semantic analysis : multilingual information retrieval with character n-grams and mutual information.

Abstract

We describe an entirely statistics-based, unsupervised, and language-independent approach to multilingual information retrieval, which we call Latent Morpho-Semantic Analysis (LMSA). LMSA overcomes some of the shortcomings of related previous approaches such as Latent Semantic Analysis (LSA). LMSA has an important theoretical advantage over LSA: it combines well-known techniques in a novel way to break the terms of LSA down into units which correspond more closely to morphemes. Thus, it has a particular appeal for use with morphologically complex languages such as Arabic. We show through empirical results that the theoretical advantages of LMSA can translate into significant gains in precision in multilingual information retrieval tests. These gains are not matched either when a standard stemmer is used with LSA, or when terms are indiscriminately broken down into n-grams.

Authors:
; ;  [1]
  1. New Mexico State University
Publication Date:
Research Org.:
Sandia National Laboratories
Sponsoring Org.:
USDOE
OSTI Identifier:
947254
Report Number(s):
SAND2008-5395C
TRN: US200909%%6
DOE Contract Number:  
AC04-94AL85000
Resource Type:
Conference
Resource Relation:
Conference: Proposed for presentation at the 22nd International Conference on Computational Linguistics held August 16-24, 2008 in Manchester, UK.
Country of Publication:
United States
Language:
English
Subject:
99 GENERAL AND MISCELLANEOUS//MATHEMATICS, COMPUTING, AND INFORMATION SCIENCE; ACCURACY; INFORMATION RETRIEVAL; STANDARDIZED TERMINOLOGY; MACHINE TRANSLATIONS

Citation Formats

Bader, Brett William, Chew, Peter A, and Abdelali, Ahmed. Latent morpho-semantic analysis : multilingual information retrieval with character n-grams and mutual information.. United States: N. p., 2008. Web.
Bader, Brett William, Chew, Peter A, & Abdelali, Ahmed. Latent morpho-semantic analysis : multilingual information retrieval with character n-grams and mutual information.. United States.
Bader, Brett William, Chew, Peter A, and Abdelali, Ahmed. Fri . "Latent morpho-semantic analysis : multilingual information retrieval with character n-grams and mutual information.". United States.
@article{osti_947254,
title = {Latent morpho-semantic analysis : multilingual information retrieval with character n-grams and mutual information.},
author = {Bader, Brett William and Chew, Peter A and Abdelali, Ahmed},
abstractNote = {We describe an entirely statistics-based, unsupervised, and language-independent approach to multilingual information retrieval, which we call Latent Morpho-Semantic Analysis (LMSA). LMSA overcomes some of the shortcomings of related previous approaches such as Latent Semantic Analysis (LSA). LMSA has an important theoretical advantage over LSA: it combines well-known techniques in a novel way to break the terms of LSA down into units which correspond more closely to morphemes. Thus, it has a particular appeal for use with morphologically complex languages such as Arabic. We show through empirical results that the theoretical advantages of LMSA can translate into significant gains in precision in multilingual information retrieval tests. These gains are not matched either when a standard stemmer is used with LSA, or when terms are indiscriminately broken down into n-grams.},
doi = {},
url = {https://www.osti.gov/biblio/947254}, journal = {},
number = ,
volume = ,
place = {United States},
year = {2008},
month = {8}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share: