Tilde’s Parallel Corpus Filtering Methods for WMT 2018

Mārcis Pinnis


Abstract
The paper describes parallel corpus filtering methods that allow reducing noise of noisy “parallel” corpora from a level where the corpora are not usable for neural machine translation training (i.e., the resulting systems fail to achieve reasonable translation quality; well below 10 BLEU points) up to a level where the trained systems show decent (over 20 BLEU points on a 10 million word dataset and up to 30 BLEU points on a 100 million word dataset). The paper also documents Tilde’s submissions to the WMT 2018 shared task on parallel corpus filtering.
Anthology ID:
W18-6486
Volume:
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Month:
October
Year:
2018
Address:
Belgium, Brussels
Venues:
EMNLP | WMT | WS
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
939–945
Language:
URL:
https://www.aclweb.org/anthology/W18-6486
DOI:
10.18653/v1/W18-6486
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W18-6486.pdf