Natural language inference (NLI) data has proven useful in benchmarking and, especially, as pretraining data for tasks requiring language understanding. However, the crowdsourcing protocol that was used to collect this data has known issues and was not explicitly optimized for either of these purposes, so it is likely far from ideal. We propose four alternative protocols, each aimed at improving either the ease with which annotators can produce sound training examples or the quality and diversity of those examples. Using these alternatives and a fifth baseline protocol, we collect and compare five new 8.5k-example training sets. In evaluations focused on transfer learning applications, our results are solidly negative, with models trained on our baseline dataset yielding good transfer performance to downstream tasks, but none of our four new methods (nor the recent ANLI) showing any improvements over that baseline. In a small silver lining, we observe that all four new protocols, especially those where annotators edit *pre-filled* text boxes, reduce previously observed issues with annotation artifacts.
Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA—a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology—the set of linguistic features each language expresses—such that we expect models performing well on this set to generalize across a large number of the world’s languages. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, and the data is collected directly in each language without the use of translation.
We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.