Chien-Sheng Wu


pdf bib
Explicit Memory Tracker with Coarse-to-Fine Reasoning for Conversational Machine Reading
Yifan Gao | Chien-Sheng Wu | Shafiq Joty | Caiming Xiong | Richard Socher | Irwin King | Michael Lyu | Steven C.H. Hoi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The goal of conversational machine reading is to answer user questions given a knowledge base text which may require asking clarification questions. Existing approaches are limited in their decision making due to struggles in extracting question-related rules and reasoning about them. In this paper, we present a new framework of conversational machine reading that comprises a novel Explicit Memory Tracker (EMT) to track whether conditions listed in the rule text have already been satisfied to make a decision. Moreover, our framework generates clarification questions by adopting a coarse-to-fine reasoning strategy, utilizing sentence-level entailment scores to weight token-level distributions. On the ShARC benchmark (blind, held-out) testset, EMT achieves new state-of-the-art results of 74.6% micro-averaged decision accuracy and 49.5 BLEU4. We also show that EMT is more interpretable by visualizing the entailment-oriented reasoning process as the conversation flows. Code and models are released at

pdf bib
Getting To Know You: User Attribute Extraction from Dialogues
Chien-Sheng Wu | Andrea Madotto | Zhaojiang Lin | Peng Xu | Pascale Fung
Proceedings of the 12th Language Resources and Evaluation Conference

User attributes provide rich and useful information for user understanding, yet structured and easy-to-use attributes are often sparsely populated. In this paper, we leverage dialogues with conversational agents, which contain strong suggestions of user information, to automatically extract user attributes. Since no existing dataset is available for this purpose, we apply distant supervision to train our proposed two-stage attribute extractor, which surpasses several retrieval and generation baselines on human evaluation. Meanwhile, we discuss potential applications (e.g., personalized recommendation and dialogue systems) of such extracted user attributes, and point out current limitations to cast light on future work.

pdf bib
Improving Limited Labeled Dialogue State Tracking with Self-Supervision
Chien-Sheng Wu | Steven C.H. Hoi | Caiming Xiong
Findings of the Association for Computational Linguistics: EMNLP 2020

Existing dialogue state tracking (DST) models require plenty of labeled data. However, collecting high-quality labels is costly, especially when the number of domains increases. In this paper, we address a practical DST problem that is rarely discussed, i.e., learning efficiently with limited labeled data. We present and investigate two self-supervised objectives: preserving latent consistency and modeling conversational behavior. We encourage a DST model to have consistent latent distributions given a perturbed input, making it more robust to an unseen scenario. We also add an auxiliary utterance generation task, modeling a potential correlation between conversational behavior and dialogue states. The experimental results show that our proposed self-supervised signals can improve joint goal accuracy by 8.95% when only 1% labeled data is used on the MultiWOZ dataset. We can achieve an additional 1.76% improvement if some unlabeled data is jointly trained as semi-supervised learning. We analyze and visualize how our proposed self-supervised signals help the DST task and hope to stimulate future data-efficient DST research.

pdf bib
TOD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogue
Chien-Sheng Wu | Steven C.H. Hoi | Richard Socher | Caiming Xiong
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The underlying difference of linguistic patterns between general text and task-oriented dialogue makes existing pre-trained language models less useful in practice. In this work, we unify nine human-human and multi-turn task-oriented dialogue datasets for language modeling. To better model dialogue behavior during pre-training, we incorporate user and system tokens into the masked language modeling. We propose a contrastive objective function to simulate the response selection task. Our pre-trained task-oriented dialogue BERT (TOD-BERT) outperforms strong baselines like BERT on four downstream task-oriented dialogue applications, including intention recognition, dialogue state tracking, dialogue act prediction, and response selection. We also show that TOD-BERT has a stronger few-shot ability that can mitigate the data scarcity problem for task-oriented dialogue.

pdf bib
Discern: Discourse-Aware Entailment Reasoning Network for Conversational Machine Reading
Yifan Gao | Chien-Sheng Wu | Jingjing Li | Shafiq Joty | Steven C.H. Hoi | Caiming Xiong | Irwin King | Michael Lyu
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Document interpretation and dialog understanding are the two major challenges for conversational machine reading. In this work, we propose “Discern”, a discourse-aware entailment reasoning network to strengthen the connection and enhance the understanding of both document and dialog. Specifically, we split the document into clause-like elementary discourse units (EDU) using a pre-trained discourse segmentation model, and we train our model in a weakly-supervised manner to predict whether each EDU is entailed by the user feedback in a conversation. Based on the learned EDU and entailment representations, we either reply to the user our final decision “yes/no/irrelevant” of the initial question, or generate a follow-up question to inquiry more information. Our experiments on the ShARC benchmark (blind, held-out test set) show that Discern achieves state-of-the-art results of 78.3% macro-averaged accuracy on decision making and 64.0 BLEU1 on follow-up question generation. Code and models are released at

pdf bib
Probing Task-Oriented Dialogue Representation from Language Models
Chien-Sheng Wu | Caiming Xiong
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

This paper investigates pre-trained language models to find out which model intrinsically carries the most informative representation for task-oriented dialogue tasks. We approach the problem from two aspects: supervised classifier probe and unsupervised mutual information probe. We fine-tune a feed-forward layer as the classifier probe on top of a fixed pre-trained language model with annotated labels in a supervised way. Meanwhile, we propose an unsupervised mutual information probe to evaluate the mutual dependence between a real clustering and a representation clustering. The goals of this empirical paper are to 1) investigate probing techniques, especially from the unsupervised mutual information aspect, 2) provide guidelines of pre-trained language model selection for the dialogue research community, 3) find insights of pre-training factors for dialogue application that may be the key to success.

pdf bib
Discriminative Nearest Neighbor Few-Shot Intent Detection by Transferring Natural Language Inference
Jianguo Zhang | Kazuma Hashimoto | Wenhao Liu | Chien-Sheng Wu | Yao Wan | Philip Yu | Richard Socher | Caiming Xiong
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Intent detection is one of the core components of goal-oriented dialog systems, and detecting out-of-scope (OOS) intents is also a practically important skill. Few-shot learning is attracting much attention to mitigate data scarcity, but OOS detection becomes even more challenging. In this paper, we present a simple yet effective approach, discriminative nearest neighbor classification with deep self-attention. Unlike softmax classifiers, we leverage BERT-style pairwise encoding to train a binary classifier that estimates the best matched training example for a user input. We propose to boost the discriminative ability by transferring a natural language inference (NLI) model. Our extensive experiments on a large-scale multi-domain intent detection task show that our method achieves more stable and accurate in-domain and OOS detection accuracy than RoBERTa-based classifiers and embedding-based nearest neighbor approaches. More notably, the NLI transfer enables our 10-shot model to perform competitively with 50-shot or even full-shot classifiers, while we can keep the inference time constant by leveraging a faster embedding retrieval model.

pdf bib
Find or Classify? Dual Strategy for Slot-Value Predictions on Multi-Domain Dialog State Tracking
Jianguo Zhang | Kazuma Hashimoto | Chien-Sheng Wu | Yao Wang | Philip Yu | Richard Socher | Caiming Xiong
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

Dialog state tracking (DST) is a core component in task-oriented dialog systems. Existing approaches for DST mainly fall into one of two categories, namely, ontology-based and ontology-free methods. An ontology-based method selects a value from a candidate-value list for each target slot, while an ontology-free method extracts spans from dialog contexts. Recent work introduced a BERT-based model to strike a balance between the two methods by pre-defining categorical and non-categorical slots. However, it is not clear enough which slots are better handled by either of the two slot types, and the way to use the pre-trained model has not been well investigated. In this paper, we propose a simple yet effective dual-strategy model for DST, by adapting a single BERT-style reading comprehension model to jointly handle both the categorical and non-categorical slots. Our experiments on the MultiWOZ datasets show that our method significantly outperforms the BERT-based counterpart, finding that the key is a deep interaction between the domain-slot and context information. When evaluated on noisy (MultiWOZ 2.0) and cleaner (MultiWOZ 2.1) settings, our method performs competitively and robustly across the two different settings. Our method sets the new state of the art in the noisy setting, while performing more robustly than the best model in the cleaner setting. We also conduct a comprehensive error analysis on the dataset, including the effects of the dual strategy for each slot, to facilitate future research.


pdf bib
Clickbait? Sensational Headline Generation with Auto-tuned Reinforcement Learning
Peng Xu | Chien-Sheng Wu | Andrea Madotto | Pascale Fung
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Sensational headlines are headlines that capture people’s attention and generate reader interest. Conventional abstractive headline generation methods, unlike human writers, do not optimize for maximal reader attention. In this paper, we propose a model that generates sensational headlines without labeled data. We first train a sensationalism scorer by classifying online headlines with many comments (“clickbait”) against a baseline of headlines generated from a summarization model. The score from the sensationalism scorer is used as the reward for a reinforcement learner. However, maximizing the noisy sensationalism reward will generate unnatural phrases instead of sensational headlines. To effectively leverage this noisy reward, we propose a novel loss function, Auto-tuned Reinforcement Learning (ARL), to dynamically balance reinforcement learning (RL) with maximum likelihood estimation (MLE). Human evaluation shows that 60.8% of samples generated by our model are sensational, which is significantly better than the Pointer-Gen baseline and other RL models.

pdf bib
Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences
Genta Indra Winata | Andrea Madotto | Chien-Sheng Wu | Pascale Fung
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Training code-switched language models is difficult due to lack of data and complexity in the grammatical structure. Linguistic constraint theories have been used for decades to generate artificial code-switching sentences to cope with this issue. However, this require external word alignments or constituency parsers that create erroneous results on distant languages. We propose a sequence-to-sequence model using a copy mechanism to generate code-switching data by leveraging parallel monolingual translations from a limited source of code-switching data. The model learns how to combine words from parallel sentences and identifies when to switch one language to the other. Moreover, it captures code-switching constraints by attending and aligning the words in inputs, without requiring any external knowledge. Based on experimental results, the language model trained with the generated sentences achieves state-of-the-art performance and improves end-to-end automatic speech recognition.

pdf bib
Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems
Chien-Sheng Wu | Andrea Madotto | Ehsan Hosseini-Asl | Caiming Xiong | Richard Socher | Pascale Fung
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Over-dependence on domain ontology and lack of sharing knowledge across domains are two practical and yet less studied problems of dialogue state tracking. Existing approaches generally fall short when tracking unknown slot values during inference and often have difficulties in adapting to new domains. In this paper, we propose a Transferable Dialogue State Generator (TRADE) that generates dialogue states from utterances using copy mechanism, facilitating transfer when predicting (domain, slot, value) triplets not encountered during training. Our model is composed of an utterance encoder, a slot gate, and a state generator, which are shared across domains. Empirical results demonstrate that TRADE achieves state-of-the-art 48.62% joint goal accuracy for the five domains of MultiWOZ, a human-human dialogue dataset. In addition, we show the transferring ability by simulating zero-shot and few-shot dialogue state tracking for unseen domains. TRADE achieves 60.58% joint goal accuracy in one of the zero-shot domains, and is able to adapt to few-shot cases without forgetting already trained domains.

pdf bib
Personalizing Dialogue Agents via Meta-Learning
Andrea Madotto | Zhaojiang Lin | Chien-Sheng Wu | Pascale Fung
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Existing personalized dialogue models use human designed persona descriptions to improve dialogue consistency. Collecting such descriptions from existing dialogues is expensive and requires hand-crafted feature designs. In this paper, we propose to extend Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) to personalized dialogue learning without using any persona descriptions. Our model learns to quickly adapt to new personas by leveraging only a few dialogue samples collected from the same user, which is fundamentally different from conditioning the response on the persona descriptions. Empirical results on Persona-chat dataset (Zhang et al., 2018) indicate that our solution outperforms non-meta-learning baselines using automatic evaluation metrics, and in terms of human-evaluated fluency and consistency.


pdf bib
Improving Large-Scale Fact-Checking using Decomposable Attention Models and Lexical Tagging
Nayeon Lee | Chien-Sheng Wu | Pascale Fung
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Fact-checking of textual sources needs to effectively extract relevant information from large knowledge bases. In this paper, we extend an existing pipeline approach to better tackle this problem. We propose a neural ranker using a decomposable attention model that dynamically selects sentences to achieve promising improvement in evidence retrieval F1 by 38.80%, with (x65) speedup compared to a TF-IDF method. Moreover, we incorporate lexical tagging methods into our pipeline framework to simplify the tasks and render the model more generalizable. As a result, our framework achieves promising performance on a large-scale fact extraction and verification dataset with speedup.

pdf bib
Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-End Task-Oriented Dialog Systems
Andrea Madotto | Chien-Sheng Wu | Pascale Fung
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

End-to-end task-oriented dialog systems usually suffer from the challenge of incorporating knowledge bases. In this paper, we propose a novel yet simple end-to-end differentiable model called memory-to-sequence (Mem2Seq) to address this issue. Mem2Seq is the first neural generative model that combines the multi-hop attention over memories with the idea of pointer network. We empirically show how Mem2Seq controls each generation step, and how its multi-hop attention mechanism helps in learning correlations between memories. In addition, our model is quite general without complicated task-specific designs. As a result, we show that Mem2Seq can be trained faster and attain the state-of-the-art performance on three different task-oriented dialog datasets.

pdf bib
Code-Switching Language Modeling using Syntax-Aware Multi-Task Learning
Genta Indra Winata | Andrea Madotto | Chien-Sheng Wu | Pascale Fung
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

Lack of text data has been the major issue on code-switching language modeling. In this paper, we introduce multi-task learning based language model which shares syntax representation of languages to leverage linguistic information and tackle the low resource data issue. Our model jointly learns both language modeling and Part-of-Speech tagging on code-switched utterances. In this way, the model is able to identify the location of code-switching points and improves the prediction of next word. Our approach outperforms standard LSTM based language model, with an improvement of 9.7% and 7.4% in perplexity on SEAME Phase I and Phase II dataset respectively.

pdf bib
Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition
Genta Indra Winata | Chien-Sheng Wu | Andrea Madotto | Pascale Fung
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

We propose an LSTM-based model with hierarchical architecture on named entity recognition from code-switching Twitter data. Our model uses bilingual character representation and transfer learning to address out-of-vocabulary words. In order to mitigate data noise, we propose to use token replacement and normalization. In the 3rd Workshop on Computational Approaches to Linguistic Code-Switching Shared Task, we achieved second place with 62.76% harmonic mean F1-score for English-Spanish language pair without using any gazetteer and knowledge-based information.

pdf bib
Emo2Vec: Learning Generalized Emotion Representation by Multi-task Training
Peng Xu | Andrea Madotto | Chien-Sheng Wu | Ji Ho Park | Pascale Fung
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

In this paper, we propose Emo2Vec which encodes emotional semantics into vectors. We train Emo2Vec by multi-task learning six different emotion-related tasks, including emotion/sentiment analysis, sarcasm classification, stress detection, abusive language classification, insult detection, and personality recognition. Our evaluation of Emo2Vec shows that it outperforms existing affect-related representations, such as Sentiment-Specific Word Embedding and DeepMoji embeddings with much smaller training corpora. When concatenated with GloVe, Emo2Vec achieves competitive performances to state-of-the-art results on several tasks using a simple logistic regression classifier.


pdf bib
Real-Time Speech Emotion and Sentiment Recognition for Interactive Dialogue Systems
Dario Bertero | Farhad Bin Siddique | Chien-Sheng Wu | Yan Wan | Ricky Ho Yin Chan | Pascale Fung
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Zara: A Virtual Interactive Dialogue System Incorporating Emotion, Sentiment and Personality Recognition
Pascale Fung | Anik Dey | Farhad Bin Siddique | Ruixi Lin | Yang Yang | Dario Bertero | Yan Wan | Ricky Ho Yin Chan | Chien-Sheng Wu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

Zara, or ‘Zara the Supergirl’ is a virtual robot, that can exhibit empathy while interacting with an user, with the aid of its built in facial and emotion recognition, sentiment analysis, and speech module. At the end of the 5-10 minute conversation, Zara can give a personality analysis of the user based on all the user utterances. We have also implemented a real-time emotion recognition, using a CNN model that detects emotion from raw audio without feature extraction, and have achieved an average of 65.7% accuracy on six different emotion classes, which is an impressive 4.5% improvement from the conventional feature based SVM classification. Also, we have described a CNN based sentiment analysis module trained using out-of-domain data, that recognizes sentiment from the speech recognition transcript, which has a 74.8 F-measure when tested on human-machine dialogues.