Summary: Alignment-based profiling of Europarl data in an English-Swedish parallel
Department of Computer and Information Science
This paper profiles the Europarl part of an English-Swedish parallel corpus and compares it with three other subcorpora of the same
parallel corpus. We first describe our method for comparison which is based on alignments, both at the token level and the structural
level. Although two of the other subcorpora contains fiction, it is found that the Europarl part is the one having the highest proportion of
many types of restructurings, including additions, deletions and long distance reorderings. We explain this by the fact that the majority
of Europarl segments are parallel translations.
The Europarl corpus (Koehn, 2005) is the most widely used
corpus for training and evaluating statistical machine trans-
lation systems for European languages, as evidenced by
several recent workshops on the topic. The reasons are not
hard to understand; it is very large, it is freely available, and
it has data for all pairs of EU languages.