Haokun Liu


pdf bib
Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work?
Yada Pruksachatkun | Jason Phang | Haokun Liu | Phu Mon Htut | Xiaoyi Zhang | Richard Yuanzhe Pang | Clara Vania | Katharina Kann | Samuel R. Bowman
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large-scale study on the pretrained RoBERTa model with 110 intermediate-target task combinations. We further evaluate all trained models with 25 probing tasks meant to reveal the specific skills that drive transfer. We observe that intermediate tasks requiring high-level inference and reasoning abilities tend to work best. We also observe that target task performance is strongly correlated with higher-level abilities such as coreference resolution. However, we fail to observe more granular correlations between probing and target task performance, highlighting the need for further work on broad-coverage probing benchmarks. We also observe evidence that the forgetting of knowledge learned during pretraining may limit our analysis, highlighting the need for further work on transfer learning methods in these settings.

pdf bib
jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models
Yada Pruksachatkun | Phil Yeres | Haokun Liu | Jason Phang | Phu Mon Htut | Alex Wang | Ian Tenney | Samuel R. Bowman
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We introduce jiant, an open source toolkit for conducting multitask and transfer learning experiments on English NLU tasks. jiant enables modular and configuration driven experimentation with state-of-the-art models and a broad set of tasks for probing, transfer learning, and multitask training experiments. jiant implements over 50 NLU tasks, including all GLUE and SuperGLUE benchmark tasks. We demonstrate that jiant reproduces published performance on a variety of tasks and models, e.g., RoBERTa and BERT.

pdf bib
Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data
William Huang | Haokun Liu | Samuel R. Bowman
Proceedings of the First Workshop on Insights from Negative Results in NLP

A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks—datasets collected from crowdworkers to create an evaluation task—while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data—data built by minimally editing a set of seed examples to yield counterfactual labels—to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmented datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples. Counterfactual augmentation of natural language understanding data through standard crowdsourcing techniques does not appear to be an effective way of collecting training data and further innovation is required to make this general line of work viable.

pdf bib
Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)
Alex Warstadt | Yian Zhang | Xiaocheng Li | Haokun Liu | Samuel R. Bowman
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features, but also to use those features preferentially during fine-turning. With this goal in mind, we introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set), which consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during finetuning. We pretrain RoBERTa from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa_BASE. We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones. Eventually, with about 30B words of pretraining data, RoBERTa_BASE does consistently demonstrate a linguistic bias with some regularity. We conclude that while self-supervised pretraining is an effective way to learn helpful inductive biases, there is likely room to improve the rate at which models learn which features matter.

pdf bib
Precise Task Formalization Matters in Winograd Schema Evaluations
Haokun Liu | William Huang | Dhara Mungra | Samuel R. Bowman
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Performance on the Winograd Schema Challenge (WSC), a respected English commonsense reasoning benchmark, recently rocketed from chance accuracy to 89% on the SuperGLUE leaderboard, with relatively little corroborating evidence of a correspondingly large improvement in reasoning ability. We hypothesize that much of this improvement comes from recent changes in task formalization—the combination of input specification, loss function, and reuse of pretrained parameters—by users of the dataset, rather than improvements in the pretrained model’s reasoning ability. We perform an ablation on two Winograd Schema datasets that interpolates between the formalizations used before and after this surge, and find (i) framing the task as multiple choice improves performance dramatically and (ii)several additional techniques, including the reuse of a pretrained language modeling head, can mitigate the model’s extreme sensitivity to hyperparameters. We urge future benchmark creators to impose additional structure to minimize the impact of formalization decisions on reported results.

pdf bib
BLiMP: A Benchmark of Linguistic Minimal Pairs for English
Alex Warstadt | Alicia Parrish | Haokun Liu | Anhad Mohananey | Wei Peng | Sheng-Fu Wang | Samuel R. Bowman
Proceedings of the Society for Computation in Linguistics 2020

pdf bib
BLiMP: The Benchmark of Linguistic Minimal Pairs for English
Alex Warstadt | Alicia Parrish | Haokun Liu | Anhad Mohananey | Wei Peng | Sheng-Fu Wang | Samuel R. Bowman
Transactions of the Association for Computational Linguistics, Volume 8

We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP),1 a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English. BLiMP consists of 67 individual datasets, each containing 1,000 minimal pairs—that is, pairs of minimally different sentences that contrast in grammatical acceptability and isolate specific phenomenon in syntax, morphology, or semantics. We generate the data according to linguist-crafted grammar templates, and human aggregate agreement with the labels is 96.4%. We evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs by observing whether they assign a higher probability to the acceptable sentence in each minimal pair. We find that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena, such as negative polarity items and extraction islands.


pdf bib
Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs
Alex Warstadt | Yu Cao | Ioana Grosu | Wei Peng | Hagen Blix | Yining Nie | Anna Alsop | Shikha Bordia | Haokun Liu | Alicia Parrish | Sheng-Fu Wang | Jason Phang | Anhad Mohananey | Phu Mon Htut | Paloma Jeretic | Samuel R. Bowman
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Though state-of-the-art sentence representation models can perform tasks requiring significant knowledge of grammar, it is an open question how best to evaluate their grammatical knowledge. We explore five experimental methods inspired by prior work evaluating pretrained sentence representation models. We use a single linguistic phenomenon, negative polarity item (NPI) licensing, as a case study for our experiments. NPIs like any are grammatical only if they appear in a licensing environment like negation (Sue doesn’t have any cats vs. *Sue has any cats). This phenomenon is challenging because of the variety of NPI licensing environments that exist. We introduce an artificially generated dataset that manipulates key features of NPI licensing for the experiments. We find that BERT has significant knowledge of these features, but its success varies widely across different experimental methods. We conclude that a variety of methods is necessary to reveal all relevant aspects of a model’s grammatical knowledge in a given domain.


pdf bib
MEMD: A Diversity-Promoting Learning Framework for Short-Text Conversation
Meng Zou | Xihan Li | Haokun Liu | Zhihong Deng
Proceedings of the 27th International Conference on Computational Linguistics

Neural encoder-decoder models have been widely applied to conversational response generation, which is a research hot spot in recent years. However, conventional neural encoder-decoder models tend to generate commonplace responses like “I don’t know” regardless of what the input is. In this paper, we analyze this problem from a new perspective: latent vectors. Based on it, we propose an easy-to-extend learning framework named MEMD (Multi-Encoder to Multi-Decoder), in which an auxiliary encoder and an auxiliary decoder are introduced to provide necessary training guidance without resorting to extra data or complicating network’s inner structure. Experimental results demonstrate that our method effectively improve the quality of generated responses according to automatic metrics and human evaluations, yielding more diverse and smooth replies.