Incorporating Global Visual Features into Attention-based Neural Machine Translation.

Iacer Calixto, Qun Liu


Abstract
We introduce multi-modal, attention-based neural machine translation (NMT) models which incorporate visual features into different parts of both the encoder and the decoder. Global image features are extracted using a pre-trained convolutional neural network and are incorporated (i) as words in the source sentence, (ii) to initialise the encoder hidden state, and (iii) as additional data to initialise the decoder hidden state. In our experiments, we evaluate translations into English and German, how different strategies to incorporate global image features compare and which ones perform best. We also study the impact that adding synthetic multi-modal, multilingual data brings and find that the additional data have a positive impact on multi-modal NMT models. We report new state-of-the-art results and our best models also significantly improve on a comparable phrase-based Statistical MT (PBSMT) model trained on the Multi30k data set according to all metrics evaluated. To the best of our knowledge, it is the first time a purely neural model significantly improves over a PBSMT model on all metrics evaluated on this data set.
Anthology ID:
D17-1105
Volume:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
992–1003
Language:
URL:
https://www.aclweb.org/anthology/D17-1105
DOI:
10.18653/v1/D17-1105
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/D17-1105.pdf