DOE Patents title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Automatic generation of stop word lists for information retrieval and analysis

Abstract

Methods and systems for automatically generating lists of stop words for information retrieval and analysis. Generation of the stop words can include providing a corpus of documents and a plurality of keywords. From the corpus of documents, a term list of all terms is constructed and both a keyword adjacency frequency and a keyword frequency are determined. If a ratio of the keyword adjacency frequency to the keyword frequency for a particular term on the term list is less than a predetermined value, then that term is excluded from the term list. The resulting term list is truncated based on predetermined criteria to form a stop word list.

Inventors:
Issue Date:
Research Org.:
Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1082869
Patent Number(s):
8352469
Application Number:
12/555,962
Assignee:
Battelle Memorial Institute (Richland, WA)
Patent Classifications (CPCs):
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
DOE Contract Number:  
AC0576RL01830
Resource Type:
Patent
Country of Publication:
United States
Language:
English
Subject:
96 KNOWLEDGE MANAGEMENT AND PRESERVATION

Citation Formats

Rose, Stuart J. Automatic generation of stop word lists for information retrieval and analysis. United States: N. p., 2013. Web.
Rose, Stuart J. Automatic generation of stop word lists for information retrieval and analysis. United States.
Rose, Stuart J. Tue . "Automatic generation of stop word lists for information retrieval and analysis". United States. https://www.osti.gov/servlets/purl/1082869.
@article{osti_1082869,
title = {Automatic generation of stop word lists for information retrieval and analysis},
author = {Rose, Stuart J},
abstractNote = {Methods and systems for automatically generating lists of stop words for information retrieval and analysis. Generation of the stop words can include providing a corpus of documents and a plurality of keywords. From the corpus of documents, a term list of all terms is constructed and both a keyword adjacency frequency and a keyword frequency are determined. If a ratio of the keyword adjacency frequency to the keyword frequency for a particular term on the term list is less than a predetermined value, then that term is excluded from the term list. The resulting term list is truncated based on predetermined criteria to form a stop word list.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Jan 08 00:00:00 EST 2013},
month = {Tue Jan 08 00:00:00 EST 2013}
}

Works referenced in this record:

Automatic keyphrase extraction from scientific documents using N-gram filtration technique
conference, January 2008


A stop list for general text
journal, September 1989


Phrase-Based Hierarchical Clustering of Web Search Results
book, January 2003


ThemeRiver: visualizing thematic changes in large document collections
journal, January 2002


Full-Subtopic Retrieval with Keyphrase-Based Search Results Clustering
conference, September 2009

  • Bernardini, Andrea; Carpineto, Claudio; D'Amico, Massimiliano
  • 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology
  • https://doi.org/10.1109/WI-IAT.2009.37

Extraction of key phrases from document using statistical and linguistic analysis
conference, July 2009


Product placement engine and method
patent, March 2009