Using PRMSE to evaluate automated scoring systems in the presence of label noise
Anastassia Loukina | Nitin Madnani | Aoife Cahill | Lili Yao | Matthew S. Johnson | Brian Riordan | Daniel F. McCaffrey
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

The effect of noisy labels on the performance of NLP systems has been studied extensively for system training. In this paper, we focus on the effect that noisy labels have on system evaluation. Using automated scoring as an example, we demonstrate that the quality of human ratings used for system evaluation have a substantial impact on traditional performance metrics, making it impossible to compare system evaluations on labels with different quality. We propose that a new metric, PRMSE, developed within the educational measurement community, can help address this issue, and provide practical guidelines on using PRMSE.


Towards Implicit Content-Introducing for Generative Short-Text Conversation Systems
Lili Yao | Yaoyuan Zhang | Yansong Feng | Dongyan Zhao | Rui Yan
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The study on human-computer conversation systems is a hot research topic nowadays. One of the prevailing methods to build the system is using the generative Sequence-to-Sequence (Seq2Seq) model through neural networks. However, the standard Seq2Seq model is prone to generate trivial responses. In this paper, we aim to generate a more meaningful and informative reply when answering a given question. We propose an implicit content-introducing method which incorporates additional information into the Seq2Seq model in a flexible way. Specifically, we fuse the general decoding and the auxiliary cue word information through our proposed hierarchical gated fusion unit. Experiments on real-life data demonstrate that our model consistently outperforms a set of competitive baselines in terms of BLEU scores and human evaluation.