A Graph-based Text Similarity Measure That Employs Named Entity Information

Leonidas Tsekouras, Iraklis Varlamis, George Giannakopoulos


Abstract
Text comparison is an interesting though hard task, with many applications in Natural Language Processing. This work introduces a new text-similarity measure, which employs named-entities’ information extracted from the texts and the n-gram graphs’ model for representing documents. Using OpenCalais as a named-entity recognition service and the JINSECT toolkit for constructing and managing n-gram graphs, the text similarity measure is embedded in a text clustering algorithm (k-Means). The evaluation of the produced clusters with various clustering validity metrics shows that the extraction of named entities at a first step can be profitable for the time-performance of similarity measures that are based on the n-gram graph representation without affecting the overall performance of the NLP task.
Anthology ID:
R17-1098
Volume:
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
Month:
September
Year:
2017
Address:
Varna, Bulgaria
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
765–771
Language:
URL:
https://doi.org/10.26615/978-954-452-049-6_098
DOI:
10.26615/978-954-452-049-6_098
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://doi.org/10.26615/978-954-452-049-6_098