Using PRMSE to evaluate automated scoring systems in the presence of label noise

Anastassia Loukina, Nitin Madnani, Aoife Cahill, Lili Yao, Matthew S. Johnson, Brian Riordan, Daniel F. McCaffrey


Abstract
The effect of noisy labels on the performance of NLP systems has been studied extensively for system training. In this paper, we focus on the effect that noisy labels have on system evaluation. Using automated scoring as an example, we demonstrate that the quality of human ratings used for system evaluation have a substantial impact on traditional performance metrics, making it impossible to compare system evaluations on labels with different quality. We propose that a new metric, PRMSE, developed within the educational measurement community, can help address this issue, and provide practical guidelines on using PRMSE.
Anthology ID:
2020.bea-1.2
Volume:
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
Month:
July
Year:
2020
Address:
Seattle, WA, USA → Online
Venues:
ACL | BEA | WS
SIG:
SIGEDU
Publisher:
Association for Computational Linguistics
Note:
Pages:
18–29
Language:
URL:
https://www.aclweb.org/anthology/2020.bea-1.2
DOI:
10.18653/v1/2020.bea-1.2
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.bea-1.2.pdf