The ILSP/ARC submission to the WMT 2018 Parallel Corpus Filtering Shared Task

Vassilis Papavassiliou, Sokratis Sofianopoulos, Prokopis Prokopidis, Stelios Piperidis


Abstract
This paper describes the submission of the Institute for Language and Speech Processing/Athena Research and Innovation Center (ILSP/ARC) for the WMT 2018 Parallel Corpus Filtering shared task. We explore several properties of sentences and sentence pairs that our system explored in the context of the task with the purpose of clustering sentence pairs according to their appropriateness in training MT systems. We also discuss alternative methods for ranking the sentence pairs of the most appropriate clusters with the aim of generating the two datasets (of 10 and 100 million words as required in the task) that were evaluated. By summarizing the results of several experiments that were carried out by the organizers during the evaluation phase, our submission achieved an average BLEU score of 26.41, even though it does not make use of any language-specific resources like bilingual lexica, monolingual corpora, or MT output, while the average score of the best participant system was 27.91.
Anthology ID:
W18-6484
Volume:
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Month:
October
Year:
2018
Address:
Belgium, Brussels
Venues:
EMNLP | WMT | WS
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
928–933
Language:
URL:
https://www.aclweb.org/anthology/W18-6484
DOI:
10.18653/v1/W18-6484
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W18-6484.pdf