João Sedoc


2020

pdf bib
Learning Word Ratings for Empathy and Distress from Document-Level User Responses
João Sedoc | Sven Buechel | Yehonathan Nachmany | Anneke Buffone | Lyle Ungar
Proceedings of the 12th Language Resources and Evaluation Conference

Despite the excellent performance of black box approaches to modeling sentiment and emotion, lexica (sets of informative words and associated weights) that characterize different emotions are indispensable to the NLP community because they allow for interpretable and robust predictions. Emotion analysis of text is increasing in popularity in NLP; however, manually creating lexica for psychological constructs such as empathy has proven difficult. This paper automatically creates empathy word ratings from document-level ratings. The underlying problem of learning word ratings from higher-level supervision has to date only been addressed in an ad hoc fashion and has not used deep learning methods. We systematically compare a number of approaches to learning word ratings from higher-level supervision against a Mixed-Level Feed Forward Network (MLFFN), which we find performs best, and use the MLFFN to create the first-ever empathy lexicon. We then use Signed Spectral Clustering to gain insights into the resulting words. The empathy and distress lexica are publicly available at: http://www.wwbp.org/lexica.html.

pdf bib
SMRT Chatbots: Improving Non-Task-Oriented Dialog with Simulated Multiple Reference Training
Huda Khayrallah | João Sedoc
Findings of the Association for Computational Linguistics: EMNLP 2020

Non-task-oriented dialog models suffer from poor quality and non-diverse responses. To overcome limited conversational data, we apply Simulated Multiple Reference Training (SMRT; Khayrallah et al., 2020), and use a paraphraser to simulate multiple responses per training prompt. We find SMRT improves over a strong Transformer baseline as measured by human and automatic quality scores and lexical diversity. We also find SMRT is comparable to pretraining in human evaluation quality, and outperforms pretraining on automatic quality and lexical diversity, without requiring related-domain dialog data.

pdf bib
Proceedings of the First Workshop on Insights from Negative Results in NLP
Anna Rogers | João Sedoc | Anna Rumshisky
Proceedings of the First Workshop on Insights from Negative Results in NLP

pdf bib
Collecting Verified COVID-19 Question Answer Pairs
Adam Poliak | Max Fleming | Cash Costello | Kenton W Murray | Mahsa Yarmohammadi | Shivani Pandya | Darius Irani | Milind Agarwal | Udit Sharma | Shuo Sun | Nicola Ivanov | Lingxi Shang | Kaushik Srinivasan | Seolhwa Lee | Xu Han | Smisha Agarwal | João Sedoc
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

We release a dataset of over 2,100 COVID19 related Frequently asked Question-Answer pairs scraped from over 40 trusted websites. We include an additional 24, 000 questions pulled from online sources that have been aligned by experts with existing answered questions from our dataset. This paper describes our efforts in collecting the dataset and summarizes the resulting data. Our dataset is automatically updated daily and available at https://github.com/JHU-COVID-QA/ scraping-qas. So far, this data has been used to develop a chatbot providing users information about COVID-19. We encourage others to build analytics and tools upon this dataset as well.

pdf bib
Using the Poly-encoder for a COVID-19 Question Answering System
Seolhwa Lee | João Sedoc
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

To combat misinformation regarding COVID- 19 during this unprecedented pandemic, we propose a conversational agent that answers questions related to COVID-19. We adapt the Poly-encoder (Humeau et al., 2020) model for informational retrieval from FAQs. We show that after fine-tuning, the Poly-encoder can achieve a higher F1 score. We make our code publicly available for other researchers to use.

pdf bib
COD3S: Diverse Generation with Discrete Semantic Signatures
Nathaniel Weir | João Sedoc | Benjamin Van Durme
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present COD3S, a novel method for generating semantically diverse sentences using neural sequence-to-sequence (seq2seq) models. Conditioned on an input, seq2seqs typically produce semantically and syntactically homogeneous sets of sentences and thus perform poorly on one-to-many sequence generation tasks. Our two-stage approach improves output diversity by conditioning generation on locality-sensitive hash (LSH)-based semantic sentence codes whose Hamming distances highly correlate with human judgments of semantic textual similarity. Though it is generally applicable, we apply to causal generation, the task of predicting a proposition’s plausible causes or effects. We demonstrate through automatic and human evaluation that responses produced using our method exhibit improved diversity without degrading task performance.

pdf bib
Incremental Neural Coreference Resolution in Constant Memory
Patrick Xia | João Sedoc | Benjamin Van Durme
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We investigate modeling coreference resolution under a fixed memory constraint by extending an incremental clustering algorithm to utilize contextualized encoders and neural components. Given a new sentence, our end-to-end algorithm proposes and scores each mention span against explicit entity representations created from the earlier document context (if any). These spans are then used to update the entity’s representations before being forgotten; we only retain a fixed set of salient entities throughout the document. In this work, we successfully convert a high-performing model (Joshi et al., 2020), asymptotically reducing its memory usage to constant space with only a 0.3% relative loss in F1 on OntoNotes 5.0.

pdf bib
Item Response Theory for Efficient Human Evaluation of Chatbots
João Sedoc | Lyle Ungar
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

Conversational agent quality is currently assessed using human evaluation, and often requires an exorbitant number of comparisons to achieve statistical significance. In this paper, we introduce Item Response Theory (IRT) for chatbot evaluation, using a paired comparison in which annotators judge which system responds better to the next turn of a conversation. IRT is widely used in educational testing for simultaneously assessing the ability of test takers and the quality of test questions. It is similarly well suited for chatbot evaluation since it allows the assessment of both models and the prompts used to evaluate them. We use IRT to efficiently assess chatbots, and show that different examples from the evaluation set are better suited for comparing high-quality (nearer to human performance) than low-quality systems. Finally, we use IRT to reduce the number of evaluation examples assessed by human annotators while retaining discriminative power.

pdf bib
Learning Emotion from 100 Observations: Unexpected Robustness of Deep Learning under Strong Data Limitations
Sven Buechel | João Sedoc | H. Andrew Schwartz | Lyle Ungar
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

One of the major downsides of Deep Learning is its supposed need for vast amounts of training data. As such, these techniques appear ill-suited for NLP areas where annotated data is limited, such as less-resourced languages or emotion analysis, with its many nuanced and hard-to-acquire annotation formats. We conduct a questionnaire study indicating that indeed the vast majority of researchers in emotion analysis deems neural models inferior to traditional machine learning when training data is limited. In stark contrast to those survey results, we provide empirical evidence for English, Polish, and Portuguese that commonly used neural architectures can be trained on surprisingly few observations, outperforming n-gram based ridge regression on only 100 data points. Our analysis suggests that high-quality, pre-trained word embeddings are a main factor for achieving those results.

2019

pdf bib
Comparison of Diverse Decoding Methods from Conditional Language Models
Daphne Ippolito | Reno Kriz | João Sedoc | Maria Kustikova | Chris Callison-Burch
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

While conditional language models have greatly improved in their ability to output high quality natural language, many NLP applications benefit from being able to generate a diverse set of candidate sequences. Diverse decoding strategies aim to, within a given-sized candidate list, cover as much of the space of high-quality outputs as possible, leading to improvements for tasks that rerank and combine candidate outputs. Standard decoding methods, such as beam search, optimize for generating high likelihood sequences rather than diverse ones, though recent work has focused on increasing diversity in these methods. In this work, we perform an extensive survey of decoding-time strategies for generating diverse outputs from a conditional language model. In addition, we present a novel method where we over-sample candidates, then use clustering to remove similar sequences, thus achieving high diversity without sacrificing quality.

pdf bib
Conceptor Debiasing of Word Representations Evaluated on WEAT
Saket Karve | Lyle Ungar | João Sedoc
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

Bias in word representations, such as Word2Vec, has been widely reported and investigated, and efforts made to debias them. We apply the debiasing conceptor for post-processing both traditional and contextualized word embeddings. Our method can simultaneously remove racial and gender biases from word representations. Unlike standard debiasing methods, the debiasing conceptor can utilize heterogeneous lists of biased words without loss in performance. Finally, our empirical experiments show that the debiasing conceptor diminishes racial and gender bias of word representations as measured using the Word Embedding Association Test (WEAT) of Caliskan et al. (2017).

pdf bib
The Role of Protected Class Word Lists in Bias Identification of Contextualized Word Representations
João Sedoc | Lyle Ungar
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

Systemic bias in word embeddings has been widely reported and studied, and efforts made to debias them; however, new contextualized embeddings such as ELMo and BERT are only now being similarly studied. Standard debiasing methods require heterogeneous lists of target words to identify the “bias subspace”. We show show that using new contextualized word embeddings in conceptor debiasing allows us to more accurately debias word embeddings by breaking target word lists into more homogeneous subsets and then combining (”Or’ing”) the debiasing conceptors of the different subsets.

pdf bib
Complexity-Weighted Loss and Diverse Reranking for Sentence Simplification
Reno Kriz | João Sedoc | Marianna Apidianaki | Carolina Zheng | Gaurav Kumar | Eleni Miltsakaki | Chris Callison-Burch
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Sentence simplification is the task of rewriting texts so they are easier to understand. Recent research has applied sequence-to-sequence (Seq2Seq) models to this task, focusing largely on training-time improvements via reinforcement learning and memory augmentation. One of the main problems with applying generic Seq2Seq models for simplification is that these models tend to copy directly from the original sentence, resulting in outputs that are relatively long and complex. We aim to alleviate this issue through the use of two main techniques. First, we incorporate content word complexities, as predicted with a leveled word complexity model, into our loss function during training. Second, we generate a large set of diverse candidate simplifications at test time, and rerank these to promote fluency, adequacy, and simplicity. Here, we measure simplicity through a novel sentence complexity model. These extensions allow our models to perform competitively with state-of-the-art systems while generating simpler sentences. We report standard automatic and human evaluation metrics.

pdf bib
Continual Learning for Sentence Representations Using Conceptors
Tianlin Liu | Lyle Ungar | João Sedoc
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Distributed representations of sentences have become ubiquitous in natural language processing tasks. In this paper, we consider a continual learning scenario for sentence representations: Given a sequence of corpora, we aim to optimize the sentence encoder with respect to the new corpus while maintaining its accuracy on the old corpora. To address this problem, we propose to initialize sentence encoders with the help of corpus-independent features, and then sequentially update sentence encoders using Boolean operations of conceptor matrices to learn corpus-dependent features. We evaluate our approach on semantic textual similarity tasks and show that our proposed sentence encoder can continually learn features from new corpora while retaining its competence on previously encountered corpora.

pdf bib
ChatEval: A Tool for Chatbot Evaluation
João Sedoc | Daphne Ippolito | Arun Kirubarajan | Jai Thirani | Lyle Ungar | Chris Callison-Burch
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

Open-domain dialog systems (i.e. chatbots) are difficult to evaluate. The current best practice for analyzing and comparing these dialog systems is the use of human judgments. However, the lack of standardization in evaluation procedures, and the fact that model parameters and code are rarely published hinder systematic human evaluation experiments. We introduce a unified framework for human evaluation of chatbots that augments existing tools and provides a web-based hub for researchers to share and compare their dialog systems. Researchers can submit their trained models to the ChatEval web interface and obtain comparisons with baselines and prior work. The evaluation code is open-source to ensure standardization and transparency. In addition, we introduce open-source baseline models and evaluation datasets. ChatEval can be found at https://chateval.org.

2018

pdf bib
Modeling Empathy and Distress in Reaction to News Stories
Sven Buechel | Anneke Buffone | Barry Slaff | Lyle Ungar | João Sedoc
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Computational detection and understanding of empathy is an important factor in advancing human-computer interaction. Yet to date, text-based empathy prediction has the following major limitations: It underestimates the psychological complexity of the phenomenon, adheres to a weak notion of ground truth where empathic states are ascribed by third parties, and lacks a shared corpus. In contrast, this contribution presents the first publicly available gold standard for empathy prediction. It is constructed using a novel annotation methodology which reliably captures empathy assessments by the writer of a statement using multi-item scales. This is also the first computational work distinguishing between multiple forms of empathy, empathic concern, and personal distress, as recognized throughout psychology. Finally, we present experimental results for three different predictive models, of which a CNN performs the best.

pdf bib
ChatEval: A Tool for the Systematic Evaluation of Chatbots
João Sedoc | Daphne Ippolito | Arun Kirubarajan | Jai Thirani | Lyle Ungar | Chris Callison-Burch
Proceedings of the Workshop on Intelligent Interactive Systems and Language Generation (2IS&NLG)

2017

pdf bib
Semantic Word Clusters Using Signed Spectral Clustering
João Sedoc | Jean Gallier | Dean Foster | Lyle Ungar
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Vector space representations of words capture many aspects of word similarity, but such methods tend to produce vector spaces in which antonyms (as well as synonyms) are close to each other. For spectral clustering using such word embeddings, words are points in a vector space where synonyms are linked with positive weights, while antonyms are linked with negative weights. We present a new signed spectral normalized graph cut algorithm, signed clustering, that overlays existing thesauri upon distributionally derived vector representations of words, so that antonym relationships between word pairs are represented by negative weights. Our signed clustering algorithm produces clusters of words that simultaneously capture distributional and synonym relations. By using randomized spectral decomposition (Halko et al., 2011) and sparse matrices, our method is both fast and scalable. We validate our clusters using datasets containing human judgments of word pair similarities and show the benefit of using our word clusters for sentiment prediction.

pdf bib
Predicting Emotional Word Ratings using Distributional Representations and Signed Clustering
João Sedoc | Daniel Preoţiuc-Pietro | Lyle Ungar
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Inferring the emotional content of words is important for text-based sentiment analysis, dialogue systems and psycholinguistics, but word ratings are expensive to collect at scale and across languages or domains. We develop a method that automatically extends word-level ratings to unrated words using signed clustering of vector space word representations along with affect ratings. We use our method to determine a word’s valence and arousal, which determine its position on the circumplex model of affect, the most popular dimensional model of emotion. Our method achieves superior out-of-sample word rating prediction on both affective dimensions across three different languages when compared to state-of-the-art word similarity based methods. Our method can assist building word ratings for new languages and improve downstream tasks such as sentiment analysis and emotion detection.