Graph-based Filtering of Out-of-Vocabulary Words for Encoder-Decoder Models

Satoru Katsumata, Yukio Matsumura, Hayahide Yamagishi, Mamoru Komachi


Abstract
Encoder-decoder models typically only employ words that are frequently used in the training corpus because of the computational costs and/or to exclude noisy words. However, this vocabulary set may still include words that interfere with learning in encoder-decoder models. This paper proposes a method for selecting more suitable words for learning encoders by utilizing not only frequency, but also co-occurrence information, which we capture using the HITS algorithm. The proposed method is applied to two tasks: machine translation and grammatical error correction. For Japanese-to-English translation, this method achieved a BLEU score that was 0.56 points more than that of a baseline. It also outperformed the baseline method for English grammatical error correction, with an F-measure that was 1.48 points higher.
Anthology ID:
P18-3016
Volume:
Proceedings of ACL 2018, Student Research Workshop
Month:
July
Year:
2018
Address:
Melbourne, Australia
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
112–119
Language:
URL:
https://www.aclweb.org/anthology/P18-3016
DOI:
10.18653/v1/P18-3016
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/P18-3016.pdf