Training Deeper Neural Machine Translation Models with Transparent Attention

Ankur Bapna, Mia Chen, Orhan Firat, Yuan Cao, Yonghui Wu


Abstract
While current state-of-the-art NMT models, such as RNN seq2seq and Transformers, possess a large number of parameters, they are still shallow in comparison to convolutional models used for both text and vision applications. In this work we attempt to train significantly (2-3x) deeper Transformer and Bi-RNN encoders for machine translation. We propose a simple modification to the attention mechanism that eases the optimization of deeper models, and results in consistent gains of 0.7-1.1 BLEU on the benchmark WMT’14 English-German and WMT’15 Czech-English tasks for both architectures.
Anthology ID:
D18-1338
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
3028–3033
Language:
URL:
https://www.aclweb.org/anthology/D18-1338
DOI:
10.18653/v1/D18-1338
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/D18-1338.pdf