Experiments on Processing Overlapping Parallel Corpora

Mark Fishel, Heiki-Jaan Kaalep


Abstract
The number and sizes of parallel corpora keep growing, which makes it necessary to have automatic methods of processing them: combining, checking and improving corpora quality, etc. We here introduce a method which enables performing many of these by exploiting overlapping parallel corpora. The method finds the correspondence between sentence pairs in two corpora: first the corresponding language parts of the corpora are aligned and then the two resulting alignments are compared. The method takes into consideration slight differences in the source documents, different levels of segmentation of the input corpora, encoding differences and other aspects of the task. The paper describes two experiments conducted to test the method. In the first experiment, the Estonian-English part of the JRC-Acquis corpus was combined with another corpus of legislation texts. In the second experiment alternatively aligned versions of the JRC-Acquis are compared to each other with the example of all language pairs between English, Estonian and Latvian. Several additional conclusions about the corpora can be drawn from the results. The method proves to be effective for several parallel corpora processing tasks.
Anthology ID:
L08-1114
Volume:
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Month:
May
Year:
2008
Address:
Marrakech, Morocco
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/776_paper.pdf
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/776_paper.pdf