Scheduled DropHead: A Regularization Method for Transformer Models
Wangchunshu Zhou, Tao Ge, Furu Wei, Ming Zhou, Ke Xu
Abstract
We introduce DropHead, a structured dropout method specifically designed for regularizing the multi-head attention mechanism which is a key component of transformer. In contrast to the conventional dropout mechanism which randomly drops units or connections, DropHead drops entire attention heads during training to prevent the multi-head attention model from being dominated by a small portion of attention heads. It can help reduce the risk of overfitting and allow the models to better benefit from the multi-head attention. Given the interaction between multi-headedness and training dynamics, we further propose a novel dropout rate scheduler to adjust the dropout rate of DropHead throughout training, which results in a better regularization effect. Experimental results demonstrate that our proposed approach can improve transformer models by 0.9 BLEU score on WMT14 En-De translation task and around 1.0 accuracy for various text classification tasks.- Anthology ID:
- 2020.findings-emnlp.178
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2020
- Month:
- November
- Year:
- 2020
- Address:
- Online
- Venues:
- EMNLP | Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1971–1980
- Language:
- URL:
- https://www.aclweb.org/anthology/2020.findings-emnlp.178
- DOI:
- 10.18653/v1/2020.findings-emnlp.178
- PDF:
- http://aclanthology.lst.uni-saarland.de/2020.findings-emnlp.178.pdf