Using exemplar responses for training and evaluating automated speech scoring systems

Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov


Abstract
Automated scoring engines are usually trained and evaluated against human scores and compared to the benchmark of human-human agreement. In this paper we compare the performance of an automated speech scoring engine using two corpora: a corpus of almost 700,000 randomly sampled spoken responses with scores assigned by one or two raters during operational scoring, and a corpus of 16,500 exemplar responses with scores reviewed by multiple expert raters. We show that the choice of corpus used for model evaluation has a major effect on estimates of system performance with r varying between 0.64 and 0.80. Surprisingly, this is not the case for the choice of corpus for model training: when the training corpus is sufficiently large, the systems trained on different corpora showed almost identical performance when evaluated on the same corpus. We show that this effect is consistent across several learning algorithms. We conclude that evaluating the model on a corpus of exemplar responses if one is available provides additional evidence about system validity; at the same time, investing effort into creating a corpus of exemplar responses for model training is unlikely to lead to a substantial gain in model performance.
Anthology ID:
W18-0501
Volume:
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications
Month:
June
Year:
2018
Address:
New Orleans, Louisiana
Venues:
BEA | NAACL | WS
SIG:
SIGEDU
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–12
Language:
URL:
https://www.aclweb.org/anthology/W18-0501
DOI:
10.18653/v1/W18-0501
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W18-0501.pdf