Home

About

Advanced Search

Browse by Discipline

Scientific Societies

E-print Alerts

Add E-prints

E-print Network
FAQHELPSITE MAPCONTACT US


  Advanced Search  

 
Alignment-based profiling of Europarl data in an English-Swedish parallel Lars Ahrenberg
 

Summary: Alignment-based profiling of Europarl data in an English-Swedish parallel
corpus
Lars Ahrenberg
Department of Computer and Information Science
Link¨opings universitet
SE-58183, Sweden
lah@ida.liu.se
Abstract
This paper profiles the Europarl part of an English-Swedish parallel corpus and compares it with three other subcorpora of the same
parallel corpus. We first describe our method for comparison which is based on alignments, both at the token level and the structural
level. Although two of the other subcorpora contains fiction, it is found that the Europarl part is the one having the highest proportion of
many types of restructurings, including additions, deletions and long distance reorderings. We explain this by the fact that the majority
of Europarl segments are parallel translations.
1. Introduction
The Europarl corpus (Koehn, 2005) is the most widely used
corpus for training and evaluating statistical machine trans-
lation systems for European languages, as evidenced by
several recent workshops on the topic. The reasons are not
hard to understand; it is very large, it is freely available, and
it has data for all pairs of EU languages.

  

Source: Ahrenberg, Lars - Department of Computer and Information Science, Linköpings Universitet

 

Collections: Computer Technologies and Information Sciences