Red-faced ROUGE: Examining the Suitability of ROUGE for Opinion Summary Evaluation

Wenyi Tay, Aditya Joshi, Xiuzhen Zhang, Sarvnaz Karimi, Stephen Wan


Abstract
One of the most common metrics to automatically evaluate opinion summaries is ROUGE, a metric developed for text summarisation. ROUGE counts the overlap of word or word units between a candidate summary against reference summaries. This formulation treats all words in the reference summary equally.In opinion summaries, however, not all words in the reference are equally important. Opinion summarisation requires to correctly pair two types of semantic information: (1) aspect or opinion target; and (2) polarity of candidate and reference summaries. We investigate the suitability of ROUGE for evaluating opin-ion summaries of online reviews. Using three simulation-based experiments, we evaluate the behaviour of ROUGE for opinion summarisation on the ability to match aspect and polarity. We show that ROUGE cannot distinguish opinion summaries of similar or opposite polarities for the same aspect. Moreover,ROUGE scores have significant variance under different configuration settings. As a result, we present three recommendations for future work that uses ROUGE to evaluate opinion summarisation.
Anthology ID:
U19-1008
Volume:
Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association
Month:
4--6 December
Year:
2019
Address:
Sydney, Australia
Venue:
ALTA
SIG:
Publisher:
Australasian Language Technology Association
Note:
Pages:
52–60
Language:
URL:
https://www.aclweb.org/anthology/U19-1008
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/U19-1008.pdf