skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Automatic generation of stop word lists for information retrieval and analysis

Patent ·
OSTI ID:1082869

Methods and systems for automatically generating lists of stop words for information retrieval and analysis. Generation of the stop words can include providing a corpus of documents and a plurality of keywords. From the corpus of documents, a term list of all terms is constructed and both a keyword adjacency frequency and a keyword frequency are determined. If a ratio of the keyword adjacency frequency to the keyword frequency for a particular term on the term list is less than a predetermined value, then that term is excluded from the term list. The resulting term list is truncated based on predetermined criteria to form a stop word list.

Research Organization:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Organization:
USDOE
DOE Contract Number:
AC0576RL01830
Assignee:
Battelle Memorial Institute (Richland, WA)
Patent Number(s):
8,352,469
Application Number:
12/555,962
OSTI ID:
1082869
Country of Publication:
United States
Language:
English

References (14)

Automatic keyphrase extraction from scientific documents using N-gram filtration technique conference January 2008
Context-based key phrase discovery and similarity measurement utilizing search engine query logs patent December 2009
A stop list for general text journal September 1989
Method for data and text mining and literature-based discovery patent April 2005
Knowledge system method and appparatus patent December 2010
Systems and methods for employing an orthogonal corpus for document indexing patent September 2007
Process and system for retrieval of documents using context-relevant semantic profiles patent February 2001
Phrase-Based Hierarchical Clustering of Web Search Results book January 2003
ThemeRiver: visualizing thematic changes in large document collections journal January 2002
Full-Subtopic Retrieval with Keyphrase-Based Search Results Clustering
  • Bernardini, Andrea; Carpineto, Claudio; D'Amico, Massimiliano
  • 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology https://doi.org/10.1109/WI-IAT.2009.37
conference September 2009
Method and apparatus for establishing topic word classes based on an entropy cost function to retrieve documents represented by the topic words patent October 2000
Extraction of key phrases from document using statistical and linguistic analysis conference July 2009
Method and apparatus for automatically identifying keywords within a document patent October 2002
Product placement engine and method patent March 2009