Scheduled DropHead: A Regularization Method for Transformer Models

Wangchunshu Zhou, Tao Ge, Furu Wei, Ming Zhou, Ke Xu


Abstract
We introduce DropHead, a structured dropout method specifically designed for regularizing the multi-head attention mechanism which is a key component of transformer. In contrast to the conventional dropout mechanism which randomly drops units or connections, DropHead drops entire attention heads during training to prevent the multi-head attention model from being dominated by a small portion of attention heads. It can help reduce the risk of overfitting and allow the models to better benefit from the multi-head attention. Given the interaction between multi-headedness and training dynamics, we further propose a novel dropout rate scheduler to adjust the dropout rate of DropHead throughout training, which results in a better regularization effect. Experimental results demonstrate that our proposed approach can improve transformer models by 0.9 BLEU score on WMT14 En-De translation task and around 1.0 accuracy for various text classification tasks.
Anthology ID:
2020.findings-emnlp.178
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Venues:
EMNLP | Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1971–1980
Language:
URL:
https://www.aclweb.org/anthology/2020.findings-emnlp.178
DOI:
10.18653/v1/2020.findings-emnlp.178
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.findings-emnlp.178.pdf