Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network

  Advanced Search  

Alignment-based profiling of Europarl data in an English-Swedish parallel Lars Ahrenberg

Summary: Alignment-based profiling of Europarl data in an English-Swedish parallel
Lars Ahrenberg
Department of Computer and Information Science
Link¨opings universitet
SE-58183, Sweden
This paper profiles the Europarl part of an English-Swedish parallel corpus and compares it with three other subcorpora of the same
parallel corpus. We first describe our method for comparison which is based on alignments, both at the token level and the structural
level. Although two of the other subcorpora contains fiction, it is found that the Europarl part is the one having the highest proportion of
many types of restructurings, including additions, deletions and long distance reorderings. We explain this by the fact that the majority
of Europarl segments are parallel translations.
1. Introduction
The Europarl corpus (Koehn, 2005) is the most widely used
corpus for training and evaluating statistical machine trans-
lation systems for European languages, as evidenced by
several recent workshops on the topic. The reasons are not
hard to understand; it is very large, it is freely available, and
it has data for all pairs of EU languages.


Source: Ahrenberg, Lars - Department of Computer and Information Science, Linköpings Universitet


Collections: Computer Technologies and Information Sciences