Leveraging Local and Global Patterns for Self-Attention Networks
Mingzhou Xu | Derek F. Wong | Baosong Yang | Yue Zhang | Lidia S. Chao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Self-attention networks have received increasing research attention. By default, the hidden states of each word are hierarchically calculated by attending to all words in the sentence, which assembles global information. However, several studies pointed out that taking all signals into account may lead to overlooking neighboring information (e.g. phrase pattern). To address this argument, we propose a hybrid attention mechanism to dynamically leverage both of the local and global information. Specifically, our approach uses a gating scalar for integrating both sources of the information, which is also convenient for quantifying their contributions. Experiments on various neural machine translation tasks demonstrate the effectiveness of the proposed method. The extensive analyses verify that the two types of contexts are complementary to each other, and our method gives highly effective improvements in their integration.