Alex Wang


pdf bib
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries
Alex Wang | Kyunghyun Cho | Mike Lewis
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Practical applications of abstractive summarization models are limited by frequent factual inconsistencies with respect to their input. Existing automatic evaluation metrics for summarization are largely insensitive to such errors. We propose QAGS (pronounced “kags”), an automatic evaluation protocol that is designed to identify factual inconsistencies in a generated summary. QAGS is based on the intuition that if we ask questions about a summary and its source, we will receive similar answers if the summary is factually consistent with the source. To evaluate QAGS, we collect human judgments of factual consistency on model-generated summaries for the CNN/DailyMail (Hermann et al., 2015) and XSUM (Narayan et al., 2018) summarization datasets. QAGS has substantially higher correlations with these judgments than other automatic evaluation metrics. Also, QAGS offers a natural form of interpretability: The answers and questions generated while computing QAGS indicate which tokens of a summary are inconsistent and why. We believe QAGS is a promising tool in automatically generating usable and factually consistent text. Code for QAGS will be available at

pdf bib
jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models
Yada Pruksachatkun | Phil Yeres | Haokun Liu | Jason Phang | Phu Mon Htut | Alex Wang | Ian Tenney | Samuel R. Bowman
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We introduce jiant, an open source toolkit for conducting multitask and transfer learning experiments on English NLU tasks. jiant enables modular and configuration driven experimentation with state-of-the-art models and a broad set of tasks for probing, transfer learning, and multitask training experiments. jiant implements over 50 NLU tasks, including all GLUE and SuperGLUE benchmark tasks. We demonstrate that jiant reproduces published performance on a variety of tasks and models, e.g., RoBERTa and BERT.

pdf bib
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing
Nafise Sadat Moosavi | Angela Fan | Vered Shwartz | Goran Glavaš | Shafiq Joty | Alex Wang | Thomas Wolf
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing

pdf bib
Overview of the SustaiNLP 2020 Shared Task
Alex Wang | Thomas Wolf
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing

We describe the SustaiNLP 2020 shared task: efficient inference on the SuperGLUE benchmark (Wang et al., 2019). Participants are evaluated based on performance on the benchmark as well as energy consumed in making predictions on the test sets. We describe the task, its organization, and the submitted systems. Across the six submissions to the shared task, participants achieved efficiency gains of 20× over a standard BERT (Devlin et al., 2019) baseline, while losing less than an absolute point in performance.


pdf bib
Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling
Alex Wang | Jan Hula | Patrick Xia | Raghavendra Pappagari | R. Thomas McCoy | Roma Patel | Najoung Kim | Ian Tenney | Yinghui Huang | Katherin Yu | Shuning Jin | Berlin Chen | Benjamin Van Durme | Edouard Grave | Ellie Pavlick | Samuel R. Bowman
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Natural language understanding has recently seen a surge of progress with the use of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) which are pretrained on variants of language modeling. We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Our primary results support the use language modeling, especially when combined with pretraining on additional labeled-data tasks. However, our results are mixed across pretraining tasks and show some concerning trends: In ELMo’s pretrain-then-freeze paradigm, random baselines are worryingly strong and results vary strikingly across target tasks. In addition, fine-tuning BERT on an intermediate task often negatively impacts downstream transfer. In a more positive trend, we see modest gains from multitask training, suggesting the development of more sophisticated multitask and transfer learning techniques as an avenue for further research.

pdf bib
BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model
Alex Wang | Kyunghyun Cho
Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation

We show that BERT (Devlin et al., 2018) is a Markov random field language model. This formulation gives way to a natural procedure to sample sentences from BERT. We generate from BERT and find that it can produce high quality, fluent generations. Compared to the generations of a traditional left-to-right language model, BERT generates sentences that are more diverse but of slightly worse quality.

pdf bib
On Measuring Social Biases in Sentence Encoders
Chandler May | Alex Wang | Shikha Bordia | Samuel R. Bowman | Rachel Rudinger
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The Word Embedding Association Test shows that GloVe and word2vec word embeddings exhibit human-like implicit biases based on gender, race, and other social constructs (Caliskan et al., 2017). Meanwhile, research on learning reusable text representations has begun to explore sentence-level texts, with some sentence encoders seeing enthusiastic adoption. Accordingly, we extend the Word Embedding Association Test to measure bias in sentence encoders. We then test several sentence encoders, including state-of-the-art methods such as ELMo and BERT, for the social biases studied in prior work and two important biases that are difficult or impossible to test at the word level. We observe mixed results including suspicious patterns of sensitivity that suggest the test’s assumptions may not hold in general. We conclude by proposing directions for future work on measuring bias in sentence encoders.

pdf bib
Probing What Different NLP Tasks Teach Machines about Function Word Comprehension
Najoung Kim | Roma Patel | Adam Poliak | Patrick Xia | Alex Wang | Tom McCoy | Ian Tenney | Alexis Ross | Tal Linzen | Benjamin Van Durme | Samuel R. Bowman | Ellie Pavlick
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

We introduce a set of nine challenge tasks that test for the understanding of function words. These tasks are created by structurally mutating sentences from existing datasets to target the comprehension of specific types of function words (e.g., prepositions, wh-words). Using these probing tasks, we explore the effects of various pretraining objectives for sentence encoders (e.g., language modeling, CCG supertagging and natural language inference (NLI)) on the learned representations. Our results show that pretraining on CCG—our most syntactic objective—performs the best on average across our probing tasks, suggesting that syntactic knowledge helps function word comprehension. Language modeling also shows strong performance, supporting its widespread use for pretraining state-of-the-art NLP models. Overall, no pretraining objective dominates across the board, and our function word probing tasks highlight several intuitive differences between pretraining objectives, e.g., that NLI helps the comprehension of negation.


pdf bib
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Wang | Amanpreet Singh | Julian Michael | Felix Hill | Omer Levy | Samuel Bowman
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Human ability to understand language is general, flexible, and robust. In contrast, most NLU models above the word level are designed for a specific task and struggle with out-of-domain data. If we aspire to develop models with understanding beyond the detection of superficial correspondences between inputs and outputs, then it is critical to develop a unified model that can execute a range of linguistic tasks across different domains. To facilitate research in this direction, we present the General Language Understanding Evaluation (GLUE, a benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models. For some benchmark tasks, training data is plentiful, but for others it is limited or does not match the genre of the test set. GLUE thus favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. While none of the datasets in GLUE were created from scratch for the benchmark, four of them feature privately-held test data, which is used to ensure that the benchmark is used fairly. We evaluate baselines that use ELMo (Peters et al., 2018), a powerful transfer learning technique, as well as state-of-the-art sentence representation models. The best models still achieve fairly low absolute scores. Analysis with our diagnostic dataset yields similarly weak performance over all phenomena tested, with some exceptions.


pdf bib
Learning Linguistic Descriptors of User Roles in Online Communities
Alex Wang | William L. Hamilton | Jure Leskovec
Proceedings of the First Workshop on NLP and Computational Social Science