Ali Emami


2020

pdf bib
An Analysis of Dataset Overlap on Winograd-Style Tasks
Ali Emami | Kaheer Suleman | Adam Trischler | Jackie Chi Kit Cheung
Proceedings of the 28th International Conference on Computational Linguistics

The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlaps that occur between these corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the pretraining corpora on which state-of-the-art models are trained, and that a significant drop in classification accuracy occurs when models are evaluated on instances with minimal overlap. Based on these results, we provide the WSC-Web dataset, consisting of over 60k pronoun disambiguation problems scraped from web data, being both the largest corpus to date, and having a significantly lower proportion of overlaps with current pretraining corpora.

2019

pdf bib
How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG
Paul Trichelair | Ali Emami | Adam Trischler | Kaheer Suleman | Jackie Chi Kit Cheung
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Recent studies have significantly improved the state-of-the-art on common-sense reasoning (CSR) benchmarks like the Winograd Schema Challenge (WSC) and SWAG. The question we ask in this paper is whether improved performance on these benchmarks represents genuine progress towards common-sense-enabled systems. We make case studies of both benchmarks and design protocols that clarify and qualify the results of previous work by analyzing threats to the validity of previous experimental designs. Our protocols account for several properties prevalent in common-sense benchmarks including size limitations, structural regularities, and variable instance difficulty.

pdf bib
The KnowRef Coreference Corpus: Removing Gender and Number Cues for Difficult Pronominal Anaphora Resolution
Ali Emami | Paul Trichelair | Adam Trischler | Kaheer Suleman | Hannes Schulz | Jackie Chi Kit Cheung
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We introduce a new benchmark for coreference resolution and NLI, KnowRef, that targets common-sense understanding and world knowledge. Previous coreference resolution tasks can largely be solved by exploiting the number and gender of the antecedents, or have been handcrafted and do not reflect the diversity of naturally occurring text. We present a corpus of over 8,000 annotated text passages with ambiguous pronominal anaphora. These instances are both challenging and realistic. We show that various coreference systems, whether rule-based, feature-rich, or neural, perform significantly worse on the task than humans, who display high inter-annotator agreement. To explain this performance gap, we show empirically that state-of-the art models often fail to capture context, instead relying on the gender or number of candidate antecedents to make a decision. We then use problem-specific insights to propose a data-augmentation trick called antecedent switching to alleviate this tendency in models. Finally, we show that antecedent switching yields promising results on other tasks as well: we use it to achieve state-of-the-art results on the GAP coreference task.

2018

pdf bib
A Generalized Knowledge Hunting Framework for the Winograd Schema Challenge
Ali Emami | Adam Trischler | Kaheer Suleman | Jackie Chi Kit Cheung
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

We introduce an automatic system that performs well on two common-sense reasoning tasks, the Winograd Schema Challenge (WSC) and the Choice of Plausible Alternatives (COPA). Problem instances from these tasks require diverse, complex forms of inference and knowledge to solve. Our method uses a knowledge-hunting module to gather text from the web, which serves as evidence for candidate problem resolutions. Given an input problem, our system generates relevant queries to send to a search engine. It extracts and classifies knowledge from the returned results and weighs it to make a resolution. Our approach improves F1 performance on the WSC by 0.16 over the previous best and is competitive with the state-of-the-art on COPA, demonstrating its general applicability.

pdf bib
A Knowledge Hunting Framework for Common Sense Reasoning
Ali Emami | Noelia De La Cruz | Adam Trischler | Kaheer Suleman | Jackie Chi Kit Cheung
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We introduce an automatic system that achieves state-of-the-art results on the Winograd Schema Challenge (WSC), a common sense reasoning task that requires diverse, complex forms of inference and knowledge. Our method uses a knowledge hunting module to gather text from the web, which serves as evidence for candidate problem resolutions. Given an input problem, our system generates relevant queries to send to a search engine, then extracts and classifies knowledge from the returned results and weighs them to make a resolution. Our approach improves F1 performance on the full WSC by 0.21 over the previous best and represents the first system to exceed 0.5 F1. We further demonstrate that the approach is competitive on the Choice of Plausible Alternatives (COPA) task, which suggests that it is generally applicable.