Summary: Towards High Speed Grammar Induction on
Large Text Corpora
Pieter Adriaans 12 , Marten Trautwein 1 , and Marco Vervoort 2
1 Perot Systems Nederland BV, P.O.Box 2729, NL3800 GG Amersfoort, The
2 University of Amsterdam, FdNWI, Plantage Muidergracht 24, NL1018 TV
Amsterdam, The Netherlands
Abstract. In this paper we describe an efficient and scalable implemen
tation for grammar induction based on the EMILE approach (, ,,
, ). The current EMILE 4.1 implementation () is one of the first
efficient grammar induction algorithms that work on free text. Although
EMILE 4.1 is far from perfect, it enables researchers to do empirical
grammar induction research on various types of corpora.
The EMILE approach is based on notions from categorial grammar (cf.
), which is known to generate the class of contextfree languages.
EMILE learns from positive examples only (cf. , , ). We describe
the algorithms underlying the approach and some interesting practical
results on small and large text collections. As shown in the articles men