| | |
Summary: Generating grammars for SGML tagged
texts lacking DTD \Lambda
Helena Ahonen
University of Helsinki
Heikki Mannila
University of Helsinki
Erja Nikunen
Research Centre for Domestic Languages
Abstract
We describe a technique for forming a context free grammar for a
document that has some kind of tagging --- structural or typograph
ical --- but no concise description of the structure is available. The
technique is based on ideas from machine learning. It forms first a set
of finitestate automata describing the document completely. These
automata are modified by considering certain context conditions; the
modifications correspond to generalizing the underlying languages. Fi
nally, the automata are converted into regular expressions, which are
then used to construct the grammar. An alternative representation,
characteristic kgrams, is also introduced. Additionally, the paper de
scribes some interactive operations necessary for generating a gram
|