Improving Low-Resource Neural Machine Translation with Filtered Pseudo-Parallel Corpus

Aizhan Imankulova, Takayuki Sato, Mamoru Komachi


Abstract
Large-scale parallel corpora are indispensable to train highly accurate machine translators. However, manually constructed large-scale parallel corpora are not freely available in many language pairs. In previous studies, training data have been expanded using a pseudo-parallel corpus obtained using machine translation of the monolingual corpus in the target language. However, in low-resource language pairs in which only low-accuracy machine translation systems can be used, translation quality is reduces when a pseudo-parallel corpus is used naively. To improve machine translation performance with low-resource language pairs, we propose a method to expand the training data effectively via filtering the pseudo-parallel corpus using a quality estimation based on back-translation. As a result of experiments with three language pairs using small, medium, and large parallel corpora, language pairs with fewer training data filtered out more sentence pairs and improved BLEU scores more significantly.
Anthology ID:
W17-5704
Volume:
Proceedings of the 4th Workshop on Asian Translation (WAT2017)
Month:
November
Year:
2017
Address:
Taipei, Taiwan
Venues:
WAT | WS
SIG:
Publisher:
Asian Federation of Natural Language Processing
Note:
Pages:
70–78
Language:
URL:
https://www.aclweb.org/anthology/W17-5704
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W17-5704.pdf