Multi-word Entity Classification in a Highly Multilingual Environment

Sophie Chesney, Guillaume Jacquet, Ralf Steinberger, Jakub Piskorski


Abstract
This paper describes an approach for the classification of millions of existing multi-word entities (MWEntities), such as organisation or event names, into thirteen category types, based only on the tokens they contain. In order to classify our very large in-house collection of multilingual MWEntities into an application-oriented set of entity categories, we trained and tested distantly-supervised classifiers in 43 languages based on MWEntities extracted from BabelNet. The best-performing classifier was the multi-class SVM using a TF.IDF-weighted data representation. Interestingly, one unique classifier trained on a mix of all languages consistently performed better than classifiers trained for individual languages, reaching an averaged F1-value of 88.8%. In this paper, we present the training and test data, including a human evaluation of its accuracy, describe the methods used to train the classifiers, and discuss the results.
Anthology ID:
W17-1702
Volume:
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)
Month:
April
Year:
2017
Address:
Valencia, Spain
Venues:
MWE | WS
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
11–20
Language:
URL:
https://www.aclweb.org/anthology/W17-1702
DOI:
10.18653/v1/W17-1702
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W17-1702.pdf