Automatic generation of stop word lists for information retrieval and analysis
Abstract
Methods and systems for automatically generating lists of stop words for information retrieval and analysis. Generation of the stop words can include providing a corpus of documents and a plurality of keywords. From the corpus of documents, a term list of all terms is constructed and both a keyword adjacency frequency and a keyword frequency are determined. If a ratio of the keyword adjacency frequency to the keyword frequency for a particular term on the term list is less than a predetermined value, then that term is excluded from the term list. The resulting term list is truncated based on predetermined criteria to form a stop word list.
- Inventors:
- Issue Date:
- Research Org.:
- Pacific Northwest National Laboratory (PNNL), Richland, WA (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1082869
- Patent Number(s):
- 8352469
- Application Number:
- 12/555,962
- Assignee:
- Battelle Memorial Institute (Richland, WA)
- Patent Classifications (CPCs):
-
G - PHYSICS G06 - COMPUTING G06F - ELECTRIC DIGITAL DATA PROCESSING
- DOE Contract Number:
- AC0576RL01830
- Resource Type:
- Patent
- Country of Publication:
- United States
- Language:
- English
- Subject:
- 96 KNOWLEDGE MANAGEMENT AND PRESERVATION
Citation Formats
Rose, Stuart J. Automatic generation of stop word lists for information retrieval and analysis. United States: N. p., 2013.
Web.
Rose, Stuart J. Automatic generation of stop word lists for information retrieval and analysis. United States.
Rose, Stuart J. Tue .
"Automatic generation of stop word lists for information retrieval and analysis". United States. https://www.osti.gov/servlets/purl/1082869.
@article{osti_1082869,
title = {Automatic generation of stop word lists for information retrieval and analysis},
author = {Rose, Stuart J},
abstractNote = {Methods and systems for automatically generating lists of stop words for information retrieval and analysis. Generation of the stop words can include providing a corpus of documents and a plurality of keywords. From the corpus of documents, a term list of all terms is constructed and both a keyword adjacency frequency and a keyword frequency are determined. If a ratio of the keyword adjacency frequency to the keyword frequency for a particular term on the term list is less than a predetermined value, then that term is excluded from the term list. The resulting term list is truncated based on predetermined criteria to form a stop word list.},
doi = {},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue Jan 08 00:00:00 EST 2013},
month = {Tue Jan 08 00:00:00 EST 2013}
}
Works referenced in this record:
Automatic keyphrase extraction from scientific documents using N-gram filtration technique
conference, January 2008
- Kumar, Niraj; Srinathan, Kannan
- Proceeding of the eighth ACM symposium on Document engineering - DocEng '08
Context-based key phrase discovery and similarity measurement utilizing search engine query logs
patent, December 2009
- Srivastava, Abhinai; Wang, Lee; Li, Ying
- US Patent Document 7,627,559
A stop list for general text
journal, September 1989
- Fox, Christopher
- ACM SIGIR Forum, Vol. 24, Issue 1-2
Method for data and text mining and literature-based discovery
patent, April 2005
- Kostoff, Ronald N.
- US Patent Document 6,886,010
Systems and methods for employing an orthogonal corpus for document indexing
patent, September 2007
- Kon, Henry B.; Burch, George W.
- US Patent Document 7,275,061
Process and system for retrieval of documents using context-relevant semantic profiles
patent, February 2001
- Roitblat, Herbert L.
- US Patent Document 6,189,002
Phrase-Based Hierarchical Clustering of Web Search Results
book, January 2003
- Masłowska, Irmina
- Advances in Information Retrieval: 25th European Conference on IR Research, ECIR 2003, Pisa, Italy, April 14-16, 2003, Proceedings
ThemeRiver: visualizing thematic changes in large document collections
journal, January 2002
- Havre, S.; Hetzler, E.; Whitney, P.
- IEEE Transactions on Visualization and Computer Graphics, Vol. 8, Issue 1
Full-Subtopic Retrieval with Keyphrase-Based Search Results Clustering
conference, September 2009
- Bernardini, Andrea; Carpineto, Claudio; D'Amico, Massimiliano
- 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology
Method and apparatus for establishing topic word classes based on an entropy cost function to retrieve documents represented by the topic words
patent, October 2000
- Wong, Wing Shenq; Qin, An-Li
- US Patent Document 6,128,613
Extraction of key phrases from document using statistical and linguistic analysis
conference, July 2009
- Raje, Satyajeet; Tulangekar, Sanket
- Education (ICCSE), 2009 4th International Conference on Computer Science & Education
Method and apparatus for automatically identifying keywords within a document
patent, October 2002
- Turney, Peter D.
- US Patent Document 6,470,307
Product placement engine and method
patent, March 2009
- Musgrove, Timothy A.; Walsh, Robin Hiroko
- US Patent Document 7,505,969