Baosong Yang


2020

pdf bib
Uncertainty-Aware Curriculum Learning for Neural Machine Translation
Yikai Zhou | Baosong Yang | Derek F. Wong | Yu Wan | Lidia S. Chao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural machine translation (NMT) has proven to be facilitated by curriculum learning which presents examples in an easy-to-hard order at different training stages. The keys lie in the assessment of data difficulty and model competence. We propose uncertainty-aware curriculum learning, which is motivated by the intuition that: 1) the higher the uncertainty in a translation pair, the more complex and rarer the information it contains; and 2) the end of the decline in model uncertainty indicates the completeness of current training stage. Specifically, we serve cross-entropy of an example as its data difficulty and exploit the variance of distributions over the weights of the network to present the model uncertainty. Extensive experiments on various translation tasks reveal that our approach outperforms the strong baseline and related methods on both translation quality and convergence speed. Quantitative analyses reveal that the proposed strategy offers NMT the ability to automatically govern its learning schedule.

pdf bib
Domain Transfer based Data Augmentation for Neural Query Translation
Liang Yao | Baosong Yang | Haibo Zhang | Boxing Chen | Weihua Luo
Proceedings of the 28th International Conference on Computational Linguistics

Query translation (QT) serves as a critical factor in successful cross-lingual information retrieval (CLIR). Due to the lack of parallel query samples, neural-based QT models are usually optimized with synthetic data which are derived from large-scale monolingual queries. Nevertheless, such kind of pseudo corpus is mostly produced by a general-domain translation model, making it be insufficient to guide the learning of QT model. In this paper, we extend the data augmentation with a domain transfer procedure, thus to revise synthetic candidates to search-aware examples. Specifically, the domain transfer model is built upon advanced Transformer, in which layer coordination and mixed attention are exploited to speed up the refining process and leverage parameters from a pre-trained cross-lingual language model. In order to examine the effectiveness of the proposed method, we collected French-to-English and Spanish-to-English QT test sets, each of which consists of 10,000 parallel query pairs with careful manual-checking. Qualitative and quantitative analyses reveal that our model significantly outperforms strong baselines and the related domain transfer methods on both translation quality and retrieval accuracy.

pdf bib
Self-Paced Learning for Neural Machine Translation
Yu Wan | Baosong Yang | Derek F. Wong | Yikai Zhou | Lidia S. Chao | Haibo Zhang | Boxing Chen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recent studies have proven that the training of neural machine translation (NMT) can be facilitated by mimicking the learning process of humans. Nevertheless, achievements of such kind of curriculum learning rely on the quality of artificial schedule drawn up with the handcrafted features, e.g. sentence length or word rarity. We ameliorate this procedure with a more flexible manner by proposing self-paced learning, where NMT model is allowed to 1) automatically quantify the learning confidence over training examples; and 2) flexibly govern its learning via regulating the loss in each iteration step. Experimental results over multiple translation tasks demonstrate that the proposed model yields better performance than strong baselines and those models trained with human-designed curricula on both translation quality and convergence speed.

2019

pdf bib
Leveraging Local and Global Patterns for Self-Attention Networks
Mingzhou Xu | Derek F. Wong | Baosong Yang | Yue Zhang | Lidia S. Chao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self-attention networks have received increasing research attention. By default, the hidden states of each word are hierarchically calculated by attending to all words in the sentence, which assembles global information. However, several studies pointed out that taking all signals into account may lead to overlooking neighboring information (e.g. phrase pattern). To address this argument, we propose a hybrid attention mechanism to dynamically leverage both of the local and global information. Specifically, our approach uses a gating scalar for integrating both sources of the information, which is also convenient for quantifying their contributions. Experiments on various neural machine translation tasks demonstrate the effectiveness of the proposed method. The extensive analyses verify that the two types of contexts are complementary to each other, and our method gives highly effective improvements in their integration.

pdf bib
Assessing the Ability of Self-Attention Networks to Learn Word Order
Baosong Yang | Longyue Wang | Derek F. Wong | Lidia S. Chao | Zhaopeng Tu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Self-attention networks (SAN) have attracted a lot of interests due to their high parallelization and strong performance on a variety of NLP tasks, e.g. machine translation. Due to the lack of recurrence structure such as recurrent neural networks (RNN), SAN is ascribed to be weak at learning positional information of words for sequence modeling. However, neither this speculation has been empirically confirmed, nor explanations for their strong performances on machine translation tasks when “lacking positional information” have been explored. To this end, we propose a novel word reordering detection task to quantify how well the word order information learned by SAN and RNN. Specifically, we randomly move one word to another position, and examine whether a trained model can detect both the original and inserted positions. Experimental results reveal that: 1) SAN trained on word reordering detection indeed has difficulty learning the positional information even with the position embedding; and 2) SAN trained on machine translation learns better positional information than its RNN counterpart, in which position embedding plays a critical role. Although recurrence structure make the model more universally-effective on learning word order, learning objectives matter more in the downstream tasks such as machine translation.

pdf bib
Modeling Recurrence for Transformer
Jie Hao | Xing Wang | Baosong Yang | Longyue Wang | Jinfeng Zhang | Zhaopeng Tu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Recently, the Transformer model that is based solely on attention mechanisms, has advanced the state-of-the-art on various machine translation tasks. However, recent studies reveal that the lack of recurrence modeling hinders its further improvement of translation capacity. In response to this problem, we propose to directly model recurrence for Transformer with an additional recurrence encoder. In addition to the standard recurrent neural network, we introduce a novel attentive recurrent network to leverage the strengths of both attention models and recurrent networks. Experimental results on the widely-used WMT14 English⇒German and WMT17 Chinese⇒English translation tasks demonstrate the effectiveness of the proposed approach. Our studies also reveal that the proposed model benefits from a short-cut that bridges the source and target sequences with a single recurrent layer, which outperforms its deep counterpart.

pdf bib
Information Aggregation for Multi-Head Attention with Routing-by-Agreement
Jian Li | Baosong Yang | Zi-Yi Dou | Xing Wang | Michael R. Lyu | Zhaopeng Tu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Multi-head attention is appealing for its ability to jointly extract different types of information from multiple representation subspaces. Concerning the information aggregation, a common practice is to use a concatenation followed by a linear transformation, which may not fully exploit the expressiveness of multi-head attention. In this work, we propose to improve the information aggregation for multi-head attention with a more powerful routing-by-agreement algorithm. Specifically, the routing algorithm iteratively updates the proportion of how much a part (i.e. the distinct information learned from a specific subspace) should be assigned to a whole (i.e. the final output representation), based on the agreement between parts and wholes. Experimental results on linguistic probing tasks and machine translation tasks prove the superiority of the advanced information aggregation over the standard linear transformation.

pdf bib
Convolutional Self-Attention Networks
Baosong Yang | Longyue Wang | Derek F. Wong | Lidia S. Chao | Zhaopeng Tu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Self-attention networks (SANs) have drawn increasing interest due to their high parallelization in computation and flexibility in modeling dependencies. SANs can be further enhanced with multi-head attention by allowing the model to attend to information from different representation subspaces. In this work, we propose novel convolutional self-attention networks, which offer SANs the abilities to 1) strengthen dependencies among neighboring elements, and 2) model the interaction between features extracted by multiple attention heads. Experimental results of machine translation on different language pairs and model settings show that our approach outperforms both the strong Transformer baseline and other existing models on enhancing the locality of SANs. Comparing with prior studies, the proposed model is parameter free in terms of introducing no more parameters.

2018

pdf bib
Multi-Head Attention with Disagreement Regularization
Jian Li | Zhaopeng Tu | Baosong Yang | Michael R. Lyu | Tong Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Multi-head attention is appealing for the ability to jointly attend to information from different representation subspaces at different positions. In this work, we introduce a disagreement regularization to explicitly encourage the diversity among multiple attention heads. Specifically, we propose three types of disagreement regularization, which respectively encourage the subspace, the attended positions, and the output representation associated with each attention head to be different from other heads. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation tasks demonstrate the effectiveness and universality of the proposed approach.

pdf bib
Modeling Localness for Self-Attention Networks
Baosong Yang | Zhaopeng Tu | Derek F. Wong | Fandong Meng | Lidia S. Chao | Tong Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self-attention networks have proven to be of profound value for its strength of capturing global dependencies. In this work, we propose to model localness for self-attention networks, which enhances the ability of capturing useful local context. We cast localness modeling as a learnable Gaussian bias, which indicates the central and scope of the local region to be paid more attention. The bias is then incorporated into the original attention distribution to form a revised distribution. To maintain the strength of capturing long distance dependencies while enhance the ability of capturing short-range dependencies, we only apply localness modeling to lower layers of self-attention networks. Quantitative and qualitative analyses on Chinese-English and English-German translation tasks demonstrate the effectiveness and universality of the proposed approach.

2017

pdf bib
Towards Bidirectional Hierarchical Representations for Attention-based Neural Machine Translation
Baosong Yang | Derek F. Wong | Tong Xiao | Lidia S. Chao | Jingbo Zhu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

This paper proposes a hierarchical attentional neural translation model which focuses on enhancing source-side hierarchical representations by covering both local and global semantic information using a bidirectional tree-based encoder. To maximize the predictive likelihood of target words, a weighted variant of an attention mechanism is used to balance the attentive information between lexical and phrase vectors. Using a tree-based rare word encoding, the proposed model is extended to sub-word level to alleviate the out-of-vocabulary (OOV) problem. Empirical results reveal that the proposed model significantly outperforms sequence-to-sequence attention-based and tree-based neural translation models in English-Chinese translation tasks.

2015

pdf bib
Sampling-based Alignment and Hierarchical Sub-sentential Alignment in Chinese–Japanese Translation of Patents
Wei Yang | Zhongwen Zhao | Baosong Yang | Yves Lepage
Proceedings of the 2nd Workshop on Asian Translation (WAT2015)