Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network

  Advanced Search  

Discovery of Frequent Word Sequences in Text Helena AhonenMyka

Summary: Discovery of Frequent Word Sequences in Text
Helena Ahonen­Myka
University of Helsinki
Department of Computer Science
P.O.Box 26 (Teollisuuskatu 23)
FIN--00014 University of Helsinki, Finland,
Abstract. We have developed a method that extracts all maximal fre­
quent word sequences from the documents of a collection. A sequence is
said to be frequent if it appears in more than oe documents, in which oe
is the frequency threshold given. Furthermore, a sequence is maximal, if
no other frequent sequence exists that contains this sequence. The words
of a sequence do not have to appear in text consecutively.
In this paper, we describe briefly the method for finding all maximal fre­
quent word sequences in text and then extend the method for extracting
generalized sequences from annotated texts, where each word has a set
of additional, e.g. morphological, features attached to it. We aim at dis­
covering patterns which preserve as many features as possible such that
the frequency of the pattern still exceeds the frequency threshold given.
1 Introduction


Source: Ahonen, Helena - Department of Computer Science, University of Helsinki


Collections: Computer Technologies and Information Sciences