TALN/LS2N Participation at the BUCC Shared Task: Bilingual Dictionary Induction from Comparable Corpora

Martin Laville, Amir Hazem, Emmanuel Morin


Abstract
This paper describes the TALN/LS2N system participation at the Building and Using Comparable Corpora (BUCC) shared task. We first introduce three strategies: (i) a word embedding approach based on fastText embeddings; (ii) a concatenation approach using both character Skip-Gram and character CBOW models, and finally (iii) a cognates matching approach based on an exact match string similarity. Then, we present the applied strategy for the shared task which consists in the combination of the embeddings concatenation and the cognates matching approaches. The covered languages are French, English, German, Russian and Spanish. Overall, our system mixing embeddings concatenation and perfect cognates matching obtained the best results while compared to individual strategies, except for English-Russian and Russian-English language pairs for which the concatenation approach was preferred.
Anthology ID:
2020.bucc-1.9
Volume:
Proceedings of the 13th Workshop on Building and Using Comparable Corpora
Month:
May
Year:
2020
Address:
Marseille, France
Venues:
BUCC | LREC | WS
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
56–60
Language:
English
URL:
https://www.aclweb.org/anthology/2020.bucc-1.9
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.bucc-1.9.pdf