Making Asynchronous Stochastic Gradient Descent Work for Transformers

Alham Fikri Aji, Kenneth Heafield


Abstract
Asynchronous stochastic gradient descent (SGD) converges poorly for Transformer models, so synchronous SGD has become the norm for Transformer training. This is unfortunate because asynchronous SGD is faster at raw training speed since it avoids waiting for synchronization. Moreover, the Transformer model is the basis for state-of-the-art models for several tasks, including machine translation, so training speed matters. To understand why asynchronous SGD under-performs, we blur the lines between asynchronous and synchronous methods. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this method, the Transformer attains the same BLEU score 1.36 times as fast.
Anthology ID:
D19-5608
Volume:
Proceedings of the 3rd Workshop on Neural Generation and Translation
Month:
November
Year:
2019
Address:
Hong Kong
Venues:
EMNLP | NGT | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
80–89
Language:
URL:
https://www.aclweb.org/anthology/D19-5608
DOI:
10.18653/v1/D19-5608
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/D19-5608.pdf