A Repository of Conversational Datasets

Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkšić, Georgios Spithourakis, Pei-Hao Su, Ivan Vulić, Tsung-Hsien Wen


Abstract
Progress in Machine Learning is often driven by the availability of large datasets, and consistent evaluation metrics for comparing modeling approaches. To this end, we present a repository of conversational datasets consisting of hundreds of millions of examples, and a standardised evaluation procedure for conversational response selection models using 1-of-100 accuracy. The repository contains scripts that allow researchers to reproduce the standard datasets, or to adapt the pre-processing and data filtering steps to their needs. We introduce and evaluate several competitive baselines for conversational response selection, whose implementations are shared in the repository, as well as a neural encoder model that is trained on the entire training set.
Anthology ID:
W19-4101
Volume:
Proceedings of the First Workshop on NLP for Conversational AI
Month:
August
Year:
2019
Address:
Florence, Italy
Venues:
ACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–10
Language:
URL:
https://www.aclweb.org/anthology/W19-4101
DOI:
10.18653/v1/W19-4101
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W19-4101.pdf