Alex Warstadt


2020

pdf bib
Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition
Paloma Jeretic | Alex Warstadt | Suvrat Bhooshan | Adina Williams
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Natural language inference (NLI) is an increasingly important task for natural language understanding, which requires one to infer whether a sentence entails another. However, the ability of NLI models to make pragmatic inferences remains understudied. We create an IMPlicature and PRESupposition diagnostic dataset (IMPPRES), consisting of 32K semi-automatically generated sentence pairs illustrating well-studied pragmatic inference types. We use IMPPRES to evaluate whether BERT, InferSent, and BOW NLI models trained on MultiNLI (Williams et al., 2018) learn to make pragmatic inferences. Although MultiNLI appears to contain very few pairs illustrating these inference types, we find that BERT learns to draw pragmatic inferences. It reliably treats scalar implicatures triggered by “some” as entailments. For some presupposition triggers like “only”, BERT reliably recognizes the presupposition as an entailment, even when the trigger is embedded under an entailment canceling operator like negation. BOW and InferSent show weaker evidence of pragmatic reasoning. We conclude that NLI training encourages models to learn some, but not all, pragmatic inferences.

pdf bib
Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)
Alex Warstadt | Yian Zhang | Xiaocheng Li | Haokun Liu | Samuel R. Bowman
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features, but also to use those features preferentially during fine-turning. With this goal in mind, we introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set), which consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during finetuning. We pretrain RoBERTa from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa_BASE. We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones. Eventually, with about 30B words of pretraining data, RoBERTa_BASE does consistently demonstrate a linguistic bias with some regularity. We conclude that while self-supervised pretraining is an effective way to learn helpful inductive biases, there is likely room to improve the rate at which models learn which features matter.

pdf bib
BLiMP: A Benchmark of Linguistic Minimal Pairs for English
Alex Warstadt | Alicia Parrish | Haokun Liu | Anhad Mohananey | Wei Peng | Sheng-Fu Wang | Samuel R. Bowman
Proceedings of the Society for Computation in Linguistics 2020

pdf bib
BLiMP: The Benchmark of Linguistic Minimal Pairs for English
Alex Warstadt | Alicia Parrish | Haokun Liu | Anhad Mohananey | Wei Peng | Sheng-Fu Wang | Samuel R. Bowman
Transactions of the Association for Computational Linguistics, Volume 8

We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP),1 a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English. BLiMP consists of 67 individual datasets, each containing 1,000 minimal pairs—that is, pairs of minimally different sentences that contrast in grammatical acceptability and isolate specific phenomenon in syntax, morphology, or semantics. We generate the data according to linguist-crafted grammar templates, and human aggregate agreement with the labels is 96.4%. We evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs by observing whether they assign a higher probability to the acceptable sentence in each minimal pair. We find that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena, such as negative polarity items and extraction islands.

2019

pdf bib
Neural Network Acceptability Judgments
Alex Warstadt | Amanpreet Singh | Samuel R. Bowman
Transactions of the Association for Computational Linguistics, Volume 7

This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence. We introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. As baselines, we train several recurrent neural network models on acceptability classification, and find that our models outperform unsupervised models by Lau et al. (2016) on CoLA. Error-analysis on specific grammatical phenomena reveals that both Lau et al.’s models and ours learn systematic generalizations like subject-verb-object order. However, all models we test perform far below human level on a wide range of grammatical constructions.

pdf bib
Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs
Alex Warstadt | Yu Cao | Ioana Grosu | Wei Peng | Hagen Blix | Yining Nie | Anna Alsop | Shikha Bordia | Haokun Liu | Alicia Parrish | Sheng-Fu Wang | Jason Phang | Anhad Mohananey | Phu Mon Htut | Paloma Jeretic | Samuel R. Bowman
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Though state-of-the-art sentence representation models can perform tasks requiring significant knowledge of grammar, it is an open question how best to evaluate their grammatical knowledge. We explore five experimental methods inspired by prior work evaluating pretrained sentence representation models. We use a single linguistic phenomenon, negative polarity item (NPI) licensing, as a case study for our experiments. NPIs like any are grammatical only if they appear in a licensing environment like negation (Sue doesn’t have any cats vs. *Sue has any cats). This phenomenon is challenging because of the variety of NPI licensing environments that exist. We introduce an artificially generated dataset that manipulates key features of NPI licensing for the experiments. We find that BERT has significant knowledge of these features, but its success varies widely across different experimental methods. We conclude that a variety of methods is necessary to reveal all relevant aspects of a model’s grammatical knowledge in a given domain.

pdf bib
Verb Argument Structure Alternations in Word and Sentence Embeddings
Katharina Kann | Alex Warstadt | Adina Williams | Samuel R. Bowman
Proceedings of the Society for Computation in Linguistics (SCiL) 2019