| | |
Summary: Discovery of Frequent Word Sequences in Text
Helena AhonenMyka
University of Helsinki
Department of Computer Science
P.O.Box 26 (Teollisuuskatu 23)
FIN--00014 University of Helsinki, Finland,
helena.ahonenmyka@cs.helsinki.fi
Abstract. We have developed a method that extracts all maximal fre
quent word sequences from the documents of a collection. A sequence is
said to be frequent if it appears in more than oe documents, in which oe
is the frequency threshold given. Furthermore, a sequence is maximal, if
no other frequent sequence exists that contains this sequence. The words
of a sequence do not have to appear in text consecutively.
In this paper, we describe briefly the method for finding all maximal fre
quent word sequences in text and then extend the method for extracting
generalized sequences from annotated texts, where each word has a set
of additional, e.g. morphological, features attached to it. We aim at dis
covering patterns which preserve as many features as possible such that
the frequency of the pattern still exceeds the frequency threshold given.
1 Introduction
|