Michael Boratko


2020

pdf bib
ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning
Michael Boratko | Xiang Li | Tim O’Gorman | Rajarshi Das | Dan Le | Andrew McCallum
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Given questions regarding some prototypical situation — such as Name something that people usually do before they leave the house for work? — a human can easily answer them via acquired experiences. There can be multiple right answers for such questions, with some more common for a situation than others. This paper introduces a new question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international trivia game show – Family Feud. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers. We also propose a generative evaluation task where a model has to output a ranked list of answers, ideally covering all prototypical answers for a question. After presenting multiple competitive baseline models, we find that human performance still exceeds model scores on all evaluation metrics with a meaningful gap, supporting the challenging nature of the task.

2018

pdf bib
An Interface for Annotating Science Questions
Michael Boratko | Harshit Padigela | Divyendra Mikkilineni | Pritish Yuvraj | Rajarshi Das | Andrew McCallum | Maria Chang | Achille Fokoue | Pavan Kapanipathi | Nicholas Mattei | Ryan Musa | Kartik Talamadupula | Michael Witbrock
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Recent work introduces the AI2 Reasoning Challenge (ARC) and the associated ARC dataset that partitions open domain, complex science questions into an Easy Set and a Challenge Set. That work includes an analysis of 100 questions with respect to the types of knowledge and reasoning required to answer them. However, it does not include clear definitions of these types, nor does it offer information about the quality of the labels or the annotation process used. In this paper, we introduce a novel interface for human annotation of science question-answer pairs with their respective knowledge and reasoning types, in order that the classification of new questions may be improved. We build on the classification schema proposed by prior work on the ARC dataset, and evaluate the effectiveness of our interface with a preliminary study involving 10 participants.

pdf bib
A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset
Michael Boratko | Harshit Padigela | Divyendra Mikkilineni | Pritish Yuvraj | Rajarshi Das | Andrew McCallum | Maria Chang | Achille Fokoue-Nkoutche | Pavan Kapanipathi | Nicholas Mattei | Ryan Musa | Kartik Talamadupula | Michael Witbrock
Proceedings of the Workshop on Machine Reading for Question Answering

The recent work of Clark et al. (2018) introduces the AI2 Reasoning Challenge (ARC) and the associated ARC dataset that partitions open domain, complex science questions into easy and challenge sets. That paper includes an analysis of 100 questions with respect to the types of knowledge and reasoning required to answer them; however, it does not include clear definitions of these types, nor does it offer information about the quality of the labels. We propose a comprehensive set of definitions of knowledge and reasoning types necessary for answering the questions in the ARC dataset. Using ten annotators and a sophisticated annotation interface, we analyze the distribution of labels across the challenge set and statistics related to them. Additionally, we demonstrate that although naive information retrieval methods return sentences that are irrelevant to answering the query, sufficient supporting text is often present in the (ARC) corpus. Evaluating with human-selected relevant sentences improves the performance of a neural machine comprehension model by 42 points.