Biao Zhang


2020

pdf bib
Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation
Biao Zhang | Philip Williams | Ivan Titov | Rico Sennrich
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations. In this paper, we explore ways to improve them. We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics, and overcome this bottleneck via language-specific components and deepening NMT architectures. We identify the off-target translation issue (i.e. translating into a wrong target language) as the major source of the inferior zero-shot performance, and propose random online backtranslation to enforce the translation of unseen training language pairs. Experiments on OPUS-100 (a novel multilingual dataset with 100 languages) show that our approach substantially narrows the performance gap with bilingual models in both one-to-many and many-to-many settings, and improves zero-shot performance by ~10 BLEU, approaching conventional pivot-based methods.

pdf bib
A Study in Improving BLEU Reference Coverage with Diverse Automatic Paraphrasing
Rachel Bawden | Biao Zhang | Lisa Yankovskaya | Andre Tättar | Matt Post
Findings of the Association for Computational Linguistics: EMNLP 2020

We investigate a long-perceived shortcoming in the typical use of BLEU: its reliance on a single reference. Using modern neural paraphrasing techniques, we study whether automatically generating additional *diverse* references can provide better coverage of the space of valid translations and thereby improve its correlation with human judgments. Our experiments on the into-English language directions of the WMT19 metrics task (at both the system and sentence level) show that using paraphrased references does generally improve BLEU, and when it does, the more diverse the better. However, we also show that better results could be achieved if those paraphrases were to specifically target the parts of the space most relevant to the MT outputs being evaluated. Moreover, the gains remain slight even when human paraphrases are used, suggesting inherent limitations to BLEU’s capacity to correctly exploit multiple references. Surprisingly, we also find that adequacy appears to be less important, as shown by the high results of a strong sampling approach, which even beats human paraphrases when used with sentence-level BLEU.

pdf bib
Adaptive Feature Selection for End-to-End Speech Translation
Biao Zhang | Ivan Titov | Barry Haddow | Rico Sennrich
Findings of the Association for Computational Linguistics: EMNLP 2020

Information in speech signals is not evenly distributed, making it an additional challenge for end-to-end (E2E) speech translation (ST) to learn to focus on informative features. In this paper, we propose adaptive feature selection (AFS) for encoder-decoder based E2E ST. We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to ASR. A ST encoder, stacked on top of the ASR encoder, then receives the filtered features from the (frozen) ASR encoder. We take L0DROP (Zhang et al., 2020) as the backbone for AFS, and adapt it to sparsify speech features with respect to both temporal and feature dimensions. Results on LibriSpeech EnFr and MuST-C benchmarks show that AFS facilitates learning of ST by pruning out ~84% temporal features, yielding an average translation gain of ~1.3-1.6 BLEU and a decoding speedup of ~1.4x. In particular, AFS reduces the performance gap compared to the cascade baseline, and outperforms it on LibriSpeech En-Fr with a BLEU score of 18.56 (without data augmentation).

2019

pdf bib
Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
Biao Zhang | Ivan Titov | Rico Sennrich
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformer architecture for machine translation results in poor convergence and high computational overhead. Our empirical analysis suggests that convergence is poor due to gradient vanishing caused by the interaction between residual connection and layer normalization. We propose depth-scaled initialization (DS-Init), which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient back-propagation through normalization layers. To address computational cost, we propose a merged attention sublayer (MAtt) which combines a simplified average-based self-attention sublayer and the encoder-decoder attention sublayer on the decoder side. Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU (+1.1 BLEU on average for 12-layer models), while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt. Source code for reproduction will be released soon.

pdf bib
Revisiting Low-Resource Neural Machine Translation: A Case Study
Rico Sennrich | Biao Zhang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

It has been shown that the performance of neural machine translation (NMT) drops starkly in low-resource conditions, underperforming phrase-based statistical machine translation (PBSMT) and requiring large amounts of auxiliary data to achieve competitive results. In this paper, we re-assess the validity of these results, arguing that they are the result of lack of system adaptation to low-resource settings. We discuss some pitfalls to be aware of when training low-resource NMT systems, and recent techniques that have shown to be especially helpful in low-resource settings, resulting in a set of best practices for low-resource NMT. In our experiments on German–English with different amounts of IWSLT14 training data, we show that, without the use of any auxiliary monolingual or multilingual data, an optimized NMT system can outperform PBSMT with far less data than previously claimed. We also apply these techniques to a low-resource Korean–English dataset, surpassing previously reported results by 4 BLEU.

pdf bib
A Lightweight Recurrent Network for Sequence Modeling
Biao Zhang | Rico Sennrich
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Recurrent networks have achieved great success on various sequential tasks with the assistance of complex recurrent units, but suffer from severe computational inefficiency due to weak parallelization. One direction to alleviate this issue is to shift heavy computations outside the recurrence. In this paper, we propose a lightweight recurrent network, or LRN. LRN uses input and forget gates to handle long-range dependencies as well as gradient vanishing and explosion, with all parameter related calculations factored outside the recurrence. The recurrence in LRN only manipulates the weight assigned to each token, tightly connecting LRN with self-attention networks. We apply LRN as a drop-in replacement of existing recurrent units in several neural sequential models. Extensive experiments on six NLP tasks show that LRN yields the best running efficiency with little or no loss in model performance.

2018

pdf bib
Simplifying Neural Machine Translation with Addition-Subtraction Twin-Gated Recurrent Networks
Biao Zhang | Deyi Xiong | Jinsong Su | Qian Lin | Huiji Zhang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose an additionsubtraction twin-gated recurrent network (ATR) to simplify neural machine translation. The recurrent units of ATR are heavily simplified to have the smallest number of weight matrices among units of all existing gated RNNs. With the simple addition and subtraction operation, we introduce a twin-gated mechanism to build input and forget gates which are highly correlated. Despite this simplification, the essential non-linearities and capability of modeling long-distance dependencies are preserved. Additionally, the proposed ATR is more transparent than LSTM/GRU due to the simplification. Forward self-attention can be easily established in ATR, which makes the proposed network interpretable. Experiments on WMT14 translation tasks demonstrate that ATR-based neural machine translation can yield competitive performance on English-German and English-French language pairs in terms of both translation quality and speed. Further experiments on NIST Chinese-English translation, natural language inference and Chinese word segmentation verify the generality and applicability of ATR on different natural language processing tasks.

pdf bib
Accelerating Neural Transformer via an Average Attention Network
Biao Zhang | Deyi Xiong | Jinsong Su
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With parallelizable attention networks, the neural Transformer is very fast to train. However, due to the auto-regressive architecture and self-attention in the decoder, the decoding procedure becomes slow. To alleviate this issue, we propose an average attention network as an alternative to the self-attention network in the decoder of the neural Transformer. The average attention network consists of two layers, with an average layer that models dependencies on previous positions and a gating layer that is stacked over the average layer to enhance the expressiveness of the proposed attention network. We apply this network on the decoder part of the neural Transformer to replace the original target-side self-attention model. With masking tricks and dynamic programming, our model enables the neural Transformer to decode sentences over four times faster than its original version with almost no loss in training time and translation performance. We conduct a series of experiments on WMT17 translation tasks, where on 6 different language pairs, we obtain robust and consistent speed-ups in decoding.

2016

pdf bib
Variational Neural Discourse Relation Recognizer
Biao Zhang | Deyi Xiong | Jinsong Su | Qun Liu | Rongrong Ji | Hong Duan | Min Zhang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Variational Neural Machine Translation
Biao Zhang | Deyi Xiong | Jinsong Su | Hong Duan | Min Zhang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Bilingual Autoencoders with Global Descriptors for Modeling Parallel Sentences
Biao Zhang | Deyi Xiong | Jinsong Su | Hong Duan | Min Zhang
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Parallel sentence representations are important for bilingual and cross-lingual tasks in natural language processing. In this paper, we explore a bilingual autoencoder approach to model parallel sentences. We extract sentence-level global descriptors (e.g. min, max) from word embeddings, and construct two monolingual autoencoders over these descriptors on the source and target language. In order to tightly connect the two autoencoders with bilingual correspondences, we force them to share the same decoding parameters and minimize a corpus-level semantic distance between the two languages. Being optimized towards a joint objective function of reconstruction and semantic errors, our bilingual antoencoder is able to learn continuous-valued latent representations for parallel sentences. Experiments on both intrinsic and extrinsic evaluations on statistical machine translation tasks show that our autoencoder achieves substantial improvements over the baselines.

pdf bib
Convolution-Enhanced Bilingual Recursive Neural Network for Bilingual Semantic Modeling
Jinsong Su | Biao Zhang | Deyi Xiong | Ruochen Li | Jianmin Yin
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Estimating similarities at different levels of linguistic units, such as words, sub-phrases and phrases, is helpful for measuring semantic similarity of an entire bilingual phrase. In this paper, we propose a convolution-enhanced bilingual recursive neural network (ConvBRNN), which not only exploits word alignments to guide the generation of phrase structures but also integrates multiple-level information of the generated phrase structures into bilingual semantic modeling. In order to accurately learn the semantic hierarchy of a bilingual phrase, we develop a recursive neural network to constrain the learned bilingual phrase structures to be consistent with word alignments. Upon the generated source and target phrase structures, we stack a convolutional neural network to integrate vector representations of linguistic units on the structures into bilingual phrase embeddings. After that, we fully incorporate information of different linguistic units into a bilinear semantic similarity model. We introduce two max-margin losses to train the ConvBRNN model: one for the phrase structure inference and the other for the semantic similarity model. Experiments on NIST Chinese-English translation tasks demonstrate the high quality of the generated bilingual phrase structures with respect to word alignments and the effectiveness of learned semantic similarities on machine translation.

2015

pdf bib
Bilingual Correspondence Recursive Autoencoder for Statistical Machine Translation
Jinsong Su | Deyi Xiong | Biao Zhang | Yang Liu | Junfeng Yao | Min Zhang
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Shallow Convolutional Neural Network for Implicit Discourse Relation Recognition
Biao Zhang | Jinsong Su | Deyi Xiong | Yaojie Lu | Hong Duan | Junfeng Yao
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing