Kaheer Suleman


2020

pdf bib
An Analysis of Dataset Overlap on Winograd-Style Tasks
Ali Emami | Kaheer Suleman | Adam Trischler | Jackie Chi Kit Cheung
Proceedings of the 28th International Conference on Computational Linguistics

The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlaps that occur between these corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the pretraining corpora on which state-of-the-art models are trained, and that a significant drop in classification accuracy occurs when models are evaluated on instances with minimal overlap. Based on these results, we provide the WSC-Web dataset, consisting of over 60k pronoun disambiguation problems scraped from web data, being both the largest corpus to date, and having a significantly lower proportion of overlaps with current pretraining corpora.

pdf bib
On the Systematicity of Probing Contextualized Word Representations: The Case of Hypernymy in BERT
Abhilasha Ravichander | Eduard Hovy | Kaheer Suleman | Adam Trischler | Jackie Chi Kit Cheung
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

Contextualized word representations have become a driving force in NLP, motivating widespread interest in understanding their capabilities and the mechanisms by which they operate. Particularly intriguing is their ability to identify and encode conceptual abstractions. Past work has probed BERT representations for this competence, finding that BERT can correctly retrieve noun hypernyms in cloze tasks. In this work, we ask the question: do probing studies shed light on systematic knowledge in BERT representations? As a case study, we examine hypernymy knowledge encoded in BERT representations. In particular, we demonstrate through a simple consistency probe that the ability to correctly retrieve hypernyms in cloze tasks, as used in prior work, does not correspond to systematic knowledge in BERT. Our main conclusion is cautionary: even if BERT demonstrates high probing accuracy for a particular competence, it does not necessarily follow that BERT ‘understands’ a concept, and it cannot be expected to systematically generalize across applicable contexts.

2019

pdf bib
How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG
Paul Trichelair | Ali Emami | Adam Trischler | Kaheer Suleman | Jackie Chi Kit Cheung
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Recent studies have significantly improved the state-of-the-art on common-sense reasoning (CSR) benchmarks like the Winograd Schema Challenge (WSC) and SWAG. The question we ask in this paper is whether improved performance on these benchmarks represents genuine progress towards common-sense-enabled systems. We make case studies of both benchmarks and design protocols that clarify and qualify the results of previous work by analyzing threats to the validity of previous experimental designs. Our protocols account for several properties prevalent in common-sense benchmarks including size limitations, structural regularities, and variable instance difficulty.

pdf bib
Can a Gorilla Ride a Camel? Learning Semantic Plausibility from Text
Ian Porada | Kaheer Suleman | Jackie Chi Kit Cheung
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

Modeling semantic plausibility requires commonsense knowledge about the world and has been used as a testbed for exploring various knowledge representations. Previous work has focused specifically on modeling physical plausibility and shown that distributional methods fail when tested in a supervised setting. At the same time, distributional models, namely large pretrained language models, have led to improved results for many natural language understanding tasks. In this work, we show that these pretrained language models are in fact effective at modeling physical plausibility in the supervised setting. We therefore present the more difficult problem of learning to model physical plausibility directly from text. We create a training set by extracting attested events from a large corpus, and we provide a baseline for training on these attested events in a self-supervised manner and testing on a physical plausibility task. We believe results could be further improved by injecting explicit commonsense knowledge into a distributional model.

pdf bib
The KnowRef Coreference Corpus: Removing Gender and Number Cues for Difficult Pronominal Anaphora Resolution
Ali Emami | Paul Trichelair | Adam Trischler | Kaheer Suleman | Hannes Schulz | Jackie Chi Kit Cheung
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We introduce a new benchmark for coreference resolution and NLI, KnowRef, that targets common-sense understanding and world knowledge. Previous coreference resolution tasks can largely be solved by exploiting the number and gender of the antecedents, or have been handcrafted and do not reflect the diversity of naturally occurring text. We present a corpus of over 8,000 annotated text passages with ambiguous pronominal anaphora. These instances are both challenging and realistic. We show that various coreference systems, whether rule-based, feature-rich, or neural, perform significantly worse on the task than humans, who display high inter-annotator agreement. To explain this performance gap, we show empirically that state-of-the art models often fail to capture context, instead relying on the gender or number of candidate antecedents to make a decision. We then use problem-specific insights to propose a data-augmentation trick called antecedent switching to alleviate this tendency in models. Finally, we show that antecedent switching yields promising results on other tasks as well: we use it to achieve state-of-the-art results on the GAP coreference task.

2018

pdf bib
A Generalized Knowledge Hunting Framework for the Winograd Schema Challenge
Ali Emami | Adam Trischler | Kaheer Suleman | Jackie Chi Kit Cheung
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

We introduce an automatic system that performs well on two common-sense reasoning tasks, the Winograd Schema Challenge (WSC) and the Choice of Plausible Alternatives (COPA). Problem instances from these tasks require diverse, complex forms of inference and knowledge to solve. Our method uses a knowledge-hunting module to gather text from the web, which serves as evidence for candidate problem resolutions. Given an input problem, our system generates relevant queries to send to a search engine. It extracts and classifies knowledge from the returned results and weighs it to make a resolution. Our approach improves F1 performance on the WSC by 0.16 over the previous best and is competitive with the state-of-the-art on COPA, demonstrating its general applicability.

pdf bib
A Knowledge Hunting Framework for Common Sense Reasoning
Ali Emami | Noelia De La Cruz | Adam Trischler | Kaheer Suleman | Jackie Chi Kit Cheung
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We introduce an automatic system that achieves state-of-the-art results on the Winograd Schema Challenge (WSC), a common sense reasoning task that requires diverse, complex forms of inference and knowledge. Our method uses a knowledge hunting module to gather text from the web, which serves as evidence for candidate problem resolutions. Given an input problem, our system generates relevant queries to send to a search engine, then extracts and classifies knowledge from the returned results and weighs them to make a resolution. Our approach improves F1 performance on the full WSC by 0.21 over the previous best and represents the first system to exceed 0.5 F1. We further demonstrate that the approach is competitive on the Choice of Plausible Alternatives (COPA) task, which suggests that it is generally applicable.

2017

pdf bib
NewsQA: A Machine Comprehension Dataset
Adam Trischler | Tong Wang | Xingdi Yuan | Justin Harris | Alessandro Sordoni | Philip Bachman | Kaheer Suleman
Proceedings of the 2nd Workshop on Representation Learning for NLP

We present NewsQA, a challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs. Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN, with answers consisting of spans of text in the articles. We collect this dataset through a four-stage process designed to solicit exploratory questions that require reasoning. Analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing textual entailment. We measure human performance on the dataset and compare it to several strong neural models. The performance gap between humans and machines (13.3% F1) indicates that significant progress can be made on NewsQA through future research. The dataset is freely available online.

pdf bib
Frames: a corpus for adding memory to goal-oriented dialogue systems
Layla El Asri | Hannes Schulz | Shikhar Sharma | Jeremie Zumer | Justin Harris | Emery Fine | Rahul Mehrotra | Kaheer Suleman
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

This paper proposes a new dataset, Frames, composed of 1369 human-human dialogues with an average of 15 turns per dialogue. This corpus contains goal-oriented dialogues between users who are given some constraints to book a trip and assistants who search a database to find appropriate trips. The users exhibit complex decision-making behaviour which involve comparing trips, exploring different options, and selecting among the trips that were discussed during the dialogue. To drive research on dialogue systems towards handling such behaviour, we have annotated and released the dataset and we propose in this paper a task called frame tracking. This task consists of keeping track of different semantic frames throughout each dialogue. We propose a rule-based baseline and analyse the frame tracking task through this baseline.

2016

pdf bib
Natural Language Comprehension with the EpiReader
Adam Trischler | Zheng Ye | Xingdi Yuan | Philip Bachman | Alessandro Sordoni | Kaheer Suleman
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Policy Networks with Two-Stage Training for Dialogue Systems
Mehdi Fatemi | Layla El Asri | Hannes Schulz | Jing He | Kaheer Suleman
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue