TF-IDF Character N-grams versus Word Embedding-based Models for Fine-grained Event Classification: A Preliminary Study

Jakub Piskorski, Guillaume Jacquet


Abstract
Automating the detection of event mentions in online texts and their classification vis-a-vis domain-specific event type taxonomies has been acknowledged by many organisations worldwide to be of paramount importance in order to facilitate the process of intelligence gathering. This paper reports on some preliminary experiments of comparing various linguistically-lightweight approaches for fine-grained event classification based on short text snippets reporting on events. In particular, we compare the performance of a TF-IDF-weighted character n-gram SVM-based model versus SVMs trained on various of-the-shelf pre-trained word embeddings (GloVe, BERT, FastText) as features. We exploit a relatively large event corpus consisting of circa 610K short text event descriptions classified using a 25-event categories that cover political violence and protest events. The best results, i.e., 83.5% macro and 92.4% micro F1 score, were obtained using the TF-IDF-weighted character n-gram model.
Anthology ID:
2020.aespen-1.6
Volume:
Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020
Month:
May
Year:
2020
Address:
Marseille, France
Venues:
AESPEN | LREC | WS
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
26–34
Language:
English
URL:
https://www.aclweb.org/anthology/2020.aespen-1.6
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.aespen-1.6.pdf