Summary: Balkan South-East Corpora Aligned to English
Institute for Parallel Processing, Bulgarian Academy of Sciences
25A Acad. G.Bonchev St., Sofia 1113, Bulgaria
The paper describes the new corpus of 9 Balkan Southeast
languages (BSEC) aligned to English with a volume of 3 Mio
words for every language. A new aligning tool for batch
alignment is created and some functional features of different
aligners ≠ constructed and used in IPP, BAS - are discussed.
The general principles of the morphosyntactic annotations (used
in INTERA, INTEX and Multext-East projects) available for
the majority of BSEC languages are investigated for possible
harmonization and future unification. The paper examines the
principles and the main issues of the transition INTEX MTE
formats for Bulgarian and Russian.
Keywords: aligned corpora, aligners, morphosyntactic
annotations, tagset standards.
1. Aligned corpora beyond "old"