Chunyuan Li


2020

pdf bib
Few-shot Natural Language Generation for Task-Oriented Dialog
Baolin Peng | Chenguang Zhu | Chunyuan Li | Xiujun Li | Jinchao Li | Michael Zeng | Jianfeng Gao
Findings of the Association for Computational Linguistics: EMNLP 2020

As a crucial component in task-oriented dialog systems, the Natural Language Generation (NLG) module converts a dialog act represented in a semantic form into a response in natural language. The success of traditional template-based or statistical models typically relies on heavily annotated data, which is infeasible for new domains. Therefore, it is pivotal for an NLG system to generalize well with limited labelled data in real applications. To this end, we present FewshotWOZ, the first NLG benchmark to simulate the few-shot learning setting in task-oriented dialog systems. Further, we develop the SC-GPT model. It is pre-trained on a large set of annotated NLG corpus to acquire the controllable generation ability, and fine-tuned with only a few domain-specific labels to adapt to new domains. Experiments on FewshotWOZ and the large Multi-Domain-WOZ datasets show that the proposed SC-GPT significantly outperforms existing methods, measured by various automatic metrics and human evaluations.

pdf bib
Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference
Bang An | Jie Lyu | Zhenyi Wang | Chunyuan Li | Changwei Hu | Fei Tan | Ruiyi Zhang | Yifan Hu | Changyou Chen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The neural attention mechanism plays an important role in many natural language processing applications. In particular, multi-head attention extends single-head attention by allowing a model to jointly attend information from different perspectives. However, without explicit constraining, multi-head attention may suffer from attention collapse, an issue that makes different heads extract similar attentive features, thus limiting the model’s representation power. In this paper, for the first time, we provide a novel understanding of multi-head attention from a Bayesian perspective. Based on the recently developed particle-optimization sampling techniques, we propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention and consequently strengthens model’s expressiveness. Remarkably, our Bayesian interpretation provides theoretical inspirations on the not-well-understood questions: why and how one uses multi-head attention. Extensive experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity, leading to more informative representations with consistent performance improvement on multiple tasks.

pdf bib
Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space
Chunyuan Li | Xiang Gao | Yuan Li | Baolin Peng | Xiujun Li | Yizhe Zhang | Jianfeng Gao
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

When trained effectively, the Variational Autoencoder (VAE) can be both a powerful generative model and an effective representation learning framework for natural language. In this paper, we propose the first large-scale language VAE model Optimus (Organizing sentences via Pre-Trained Modeling of a Universal Space). A universal latent embedding space for sentences is first pre-trained on large text corpus, and then fine-tuned for various language generation and understanding tasks. Compared with GPT-2, Optimus enables guided language generation from an abstract level using the latent vectors. Compared with BERT, Optimus can generalize better on low-resource language understanding tasks due to the smooth latent space structure. Extensive experimental results on a wide range of language tasks demonstrate the effectiveness of Optimus. It achieves new state-of-the-art on VAE language modeling benchmarks.

pdf bib
POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training
Yizhe Zhang | Guoyin Wang | Chunyuan Li | Zhe Gan | Chris Brockett | Bill Dolan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Large-scale pre-trained language models, such as BERT and GPT-2, have achieved excellent performance in language representation learning and free-form text generation. However, these models cannot be directly employed to generate text under specified lexical constraints. To address this challenge, we present POINTER (PrOgressive INsertion-based TransformER), a simple yet novel insertion-based approach for hard-constrained text generation. The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner. This procedure is recursively applied until a sequence is completed. The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable. We pre-train our model with the proposed progressive insertion-based objective on a 12GB Wikipedia dataset, and fine-tune it on downstream hard-constrained generation tasks. Non-autoregressive decoding yields a logarithmic time complexity during inference time. Experimental results on both News and Yelp datasets demonstrate that Pointer achieves state-of-the-art performance on constrained text generation. We released the pre-trained models and the source code to facilitate future research.

pdf bib
Improving Text Generation with Student-Forcing Optimal Transport
Jianqiao Li | Chunyuan Li | Guoyin Wang | Hao Fu | Yuhchen Lin | Liqun Chen | Yizhe Zhang | Chenyang Tao | Ruiyi Zhang | Wenlin Wang | Dinghan Shen | Qian Yang | Lawrence Carin
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Neural language models are often trained with maximum likelihood estimation (MLE), where the next word is generated conditioned on the ground-truth word tokens. During testing, however, the model is instead conditioned on previously generated tokens, resulting in what is termed exposure bias. To reduce this gap between training and testing, we propose using optimal transport (OT) to match the sequences generated in these two modes. We examine the necessity of adding Student-Forcing scheme during training with an imitation learning interpretation. An extension is further proposed to improve the OT learning for long sequences, based on the structural and contextual information of the text sequences. The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.

2019

pdf bib
Robust Navigation with Language Pretraining and Stochastic Sampling
Xiujun Li | Chunyuan Li | Qiaolin Xia | Yonatan Bisk | Asli Celikyilmaz | Jianfeng Gao | Noah A. Smith | Yejin Choi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Core to the vision-and-language navigation (VLN) challenge is building robust instruction representations and action decoding schemes, which can generalize well to previously unseen instructions and environments. In this paper, we report two simple but highly effective methods to address these challenges and lead to a new state-of-the-art performance. First, we adapt large-scale pretrained language models to learn text representations that generalize better to previously unseen instructions. Second, we propose a stochastic sampling scheme to reduce the considerable gap between the expert actions in training and sampled actions in test, so that the agent can learn to correct its own mistakes during long sequential action decoding. Combining the two techniques, we achieve a new state of the art on the Room-to-Room benchmark with 6% absolute gain over the previous best result (47% -> 53%) on the Success Rate weighted by Path Length metric.

pdf bib
Implicit Deep Latent Variable Models for Text Generation
Le Fang | Chunyuan Li | Jianfeng Gao | Wen Dong | Changyou Chen
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Deep latent variable models (LVM) such as variational auto-encoder (VAE) have recently played an important role in text generation. One key factor is the exploitation of smooth latent structures to guide the generation. However, the representation power of VAEs is limited due to two reasons: (1) the Gaussian assumption is often made on the variational posteriors; and meanwhile (2) a notorious “posterior collapse” issue occurs. In this paper, we advocate sample-based representations of variational distributions for natural language, leading to implicit latent features, which can provide flexible representation power compared with Gaussian-based posteriors. We further develop an LVM to directly match the aggregated posterior to the prior. It can be viewed as a natural extension of VAEs with a regularization of maximizing mutual information, mitigating the “posterior collapse” issue. We demonstrate the effectiveness and versatility of our models in various text generation scenarios, including language modeling, unaligned style transfer, and dialog response generation. The source code to reproduce our experimental results is available on GitHub.

pdf bib
DoubleTransfer at MEDIQA 2019: Multi-Source Transfer Learning for Natural Language Understanding in the Medical Domain
Yichong Xu | Xiaodong Liu | Chunyuan Li | Hoifung Poon | Jianfeng Gao
Proceedings of the 18th BioNLP Workshop and Shared Task

This paper describes our competing system to enter the MEDIQA-2019 competition. We use a multi-source transfer learning approach to transfer the knowledge from MT-DNN and SciBERT to natural language understanding tasks in the medical domain. For transfer learning fine-tuning, we use multi-task learning on NLI, RQE and QA tasks on general and medical domains to improve performance. The proposed methods are proved effective for natural language understanding in the medical domain, and we rank the first place on the QA task.

pdf bib
Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing
Hao Fu | Chunyuan Li | Xiaodong Liu | Jianfeng Gao | Asli Celikyilmaz | Lawrence Carin
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Variational autoencoders (VAE) with an auto-regressive decoder have been applied for many natural language processing (NLP) tasks. VAE objective consists of two terms, the KL regularization term and the reconstruction term, balanced by a weighting hyper-parameter 𝛽. One notorious training difficulty is that the KL term tends to vanish. In this paper we study different scheduling schemes for 𝛽, and show that KL vanishing is caused by the lack of good latent codes in training decoder at the beginning of optimization. To remedy the issue, we propose a cyclical annealing schedule, which simply repeats the process of increasing 𝛽 multiple times. This new procedure allows us to learn more meaningful latent codes progressively by leveraging the results of previous learning cycles as warm re-restart. The effectiveness of cyclical annealing schedule is validated on a broad range of NLP tasks, including language modeling, dialog response generation and semi-supervised text classification.

2018

pdf bib
Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms
Dinghan Shen | Guoyin Wang | Wenlin Wang | Martin Renqiang Min | Qinliang Su | Yizhe Zhang | Chunyuan Li | Ricardo Henao | Lawrence Carin
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Many deep learning architectures have been proposed to model the compositionality in text sequences, requiring substantial number of parameters and expensive computations. However, there has not been a rigorous evaluation regarding the added value of sophisticated compositional functions. In this paper, we conduct a point-by-point comparative study between Simple Word-Embedding-based Models (SWEMs), consisting of parameter-free pooling operations, relative to word-embedding-based RNN/CNN models. Surprisingly, SWEMs exhibit comparable or even superior performance in the majority of cases considered. Based upon this understanding, we propose two additional pooling strategies over learned word embeddings: (i) a max-pooling operation for improved interpretability; and (ii) a hierarchical pooling operation, which preserves spatial (n-gram) information within text sequences. We present experiments on 17 datasets encompassing three tasks: (i) (long) document classification; (ii) text sequence matching; and (iii) short text tasks, including classification and tagging.

pdf bib
Joint Embedding of Words and Labels for Text Classification
Guoyin Wang | Chunyuan Li | Wenlin Wang | Yizhe Zhang | Dinghan Shen | Xinyuan Zhang | Ricardo Henao | Lawrence Carin
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Word embeddings are effective intermediate representations for capturing semantic regularities between words, when learning the representations of text sequences. We propose to view text classification as a label-word joint embedding problem: each label is embedded in the same space with the word vectors. We introduce an attention framework that measures the compatibility of embeddings between text sequences and labels. The attention is learned on a training set of labeled samples to ensure that, given a text sequence, the relevant words are weighted higher than the irrelevant ones. Our method maintains the interpretability of word embeddings, and enjoys a built-in ability to leverage alternative sources of information, in addition to input text sequences. Extensive results on the several large text datasets show that the proposed framework outperforms the state-of-the-art methods by a large margin, in terms of both accuracy and speed.

2017

pdf bib
Scalable Bayesian Learning of Recurrent Neural Networks for Language Modeling
Zhe Gan | Chunyuan Li | Changyou Chen | Yunchen Pu | Qinliang Su | Lawrence Carin
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recurrent neural networks (RNNs) have shown promising performance for language modeling. However, traditional training of RNNs using back-propagation through time often suffers from overfitting. One reason for this is that stochastic optimization (used for large training sets) does not provide good estimates of model uncertainty. This paper leverages recent advances in stochastic gradient Markov Chain Monte Carlo (also appropriate for large training sets) to learn weight uncertainty in RNNs. It yields a principled Bayesian learning algorithm, adding gradient noise during training (enhancing exploration of the model-parameter space) and model averaging when testing. Extensive experiments on various RNN models and across a broad range of applications demonstrate the superiority of the proposed approach relative to stochastic optimization.

pdf bib
Learning Generic Sentence Representations Using Convolutional Neural Networks
Zhe Gan | Yunchen Pu | Ricardo Henao | Chunyuan Li | Xiaodong He | Lawrence Carin
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We propose a new encoder-decoder approach to learn distributed sentence representations that are applicable to multiple purposes. The model is learned by using a convolutional neural network as an encoder to map an input sentence into a continuous vector, and using a long short-term memory recurrent neural network as a decoder. Several tasks are considered, including sentence reconstruction and future sentence prediction. Further, a hierarchical encoder-decoder model is proposed to encode a sentence to predict multiple future sentences. By training our models on a large collection of novels, we obtain a highly generic convolutional sentence encoder that performs well in practice. Experimental results on several benchmark datasets, and across a broad range of applications, demonstrate the superiority of the proposed model over competing methods.