skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Generative Latent Semantic Analysis: How (Computational) Linguistics Can Aid Information Retrieval.

Abstract

Abstract not provided.

Authors:
;
Publication Date:
Research Org.:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1320969
Report Number(s):
SAND2007-0385C
524373
DOE Contract Number:
AC04-94AL85000
Resource Type:
Conference
Resource Relation:
Conference: Proposed for presentation at the ACL 2007 held June 23-30, 2007 in Prague, Czech Republic.
Country of Publication:
United States
Language:
English

Citation Formats

Chew, Peter Alexander, and Abdelali, Ahmed. Generative Latent Semantic Analysis: How (Computational) Linguistics Can Aid Information Retrieval.. United States: N. p., 2007. Web.
Chew, Peter Alexander, & Abdelali, Ahmed. Generative Latent Semantic Analysis: How (Computational) Linguistics Can Aid Information Retrieval.. United States.
Chew, Peter Alexander, and Abdelali, Ahmed. Mon . "Generative Latent Semantic Analysis: How (Computational) Linguistics Can Aid Information Retrieval.". United States. doi:. https://www.osti.gov/servlets/purl/1320969.
@article{osti_1320969,
title = {Generative Latent Semantic Analysis: How (Computational) Linguistics Can Aid Information Retrieval.},
author = {Chew, Peter Alexander and Abdelali, Ahmed},
abstractNote = {Abstract not provided.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Mon Jan 01 00:00:00 EST 2007},
month = {Mon Jan 01 00:00:00 EST 2007}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:
  • We describe an entirely statistics-based, unsupervised, and language-independent approach to multilingual information retrieval, which we call Latent Morpho-Semantic Analysis (LMSA). LMSA overcomes some of the shortcomings of related previous approaches such as Latent Semantic Analysis (LSA). LMSA has an important theoretical advantage over LSA: it combines well-known techniques in a novel way to break the terms of LSA down into units which correspond more closely to morphemes. Thus, it has a particular appeal for use with morphologically complex languages such as Arabic. We show through empirical results that the theoretical advantages of LMSA can translate into significant gains in precisionmore » in multilingual information retrieval tests. These gains are not matched either when a standard stemmer is used with LSA, or when terms are indiscriminately broken down into n-grams.« less
  • Abstract not provided.
  • No abstract prepared.
  • A technique for information retrieval includes parsing a corpus to identify a number of wordform instances within each document of the corpus. A weighted morpheme-by-document matrix is generated based at least in part on the number of wordform instances within each document of the corpus and based at least in part on a weighting function. The weighted morpheme-by-document matrix separately enumerates instances of stems and affixes. Additionally or alternatively, a term-by-term alignment matrix may be generated based at least in part on the number of wordform instances within each document of the corpus. At least one lower rank approximation matrixmore » is generated by factorizing the weighted morpheme-by-document matrix and/or the term-by-term alignment matrix.« less
  • Latent Semantic Analysis (LSA) is based on the Singular Value Decomposition (SVD) of a term-by-document matrix for identifying relationships among terms and documents from co-occurrence patterns. Among the multiple ways of computing the SVD of a rectangular matrix X, one approach is to compute the eigenvalue decomposition (EVD) of a square 2 x 2 composite matrix consisting of four blocks with X and XT in the off-diagonal blocks and zero matrices in the diagonal blocks. We point out that significant value can be added to LSA by filling in some of the values in the diagonal blocks (corresponding to explicitmore » term-to-term or document-to-document associations) and computing a term-by-concept matrix from the EVD. For the case of multilingual LSA, we incorporate information on cross-language term alignments of the same sort used in Statistical Machine Translation (SMT). Since all elements of the proposed EVD-based approach can rely entirely on lexical statistics, hardly any price is paid for the improved empirical results. In particular, the approach, like LSA or SMT, can still be generalized to virtually any language(s); computation of the EVD takes similar resources to that of the SVD since all the blocks are sparse; and the results of EVD are just as economical as those of SVD.« less