Revisiting the Centroid-based Method: A Strong Baseline for Multi-Document Summarization

Demian Gholipour Ghalandari


Abstract
The centroid-based model for extractive document summarization is a simple and fast baseline that ranks sentences based on their similarity to a centroid vector. In this paper, we apply this ranking to possible summaries instead of sentences and use a simple greedy algorithm to find the best summary. Furthermore, we show possibilities to scale up to larger input document collections by selecting a small number of sentences from each document prior to constructing the summary. Experiments were done on the DUC2004 dataset for multi-document summarization. We observe a higher performance over the original model, on par with more complex state-of-the-art methods.
Anthology ID:
W17-4511
Volume:
Proceedings of the Workshop on New Frontiers in Summarization
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Venue:
WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
85–90
Language:
URL:
https://www.aclweb.org/anthology/W17-4511
DOI:
10.18653/v1/W17-4511
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W17-4511.pdf