Automatic Term Extraction from Newspaper Corpora: Making the Most of Specificity and Common Features

Patrick Drouin, Jean-Benoît Morel, Marie-Claude L’ Homme


Abstract
The first step of any terminological work is to setup a reliable, specialized corpus composed of documents written by specialists and then to apply automatic term extraction (ATE) methods to this corpus in order to retrieve a first list of potential terms. In this paper, the experiment we describe differs quite drastically from this usual process since we are applying ATE to unspecialized corpora. The corpus used for this study was built from newspaper articles retrieved from the Web using a short list of keywords. The general intuition on which this research is based is that ATE based corpus comparison techniques can be used to capture both similarities and dissimilarities between corpora. The former are exploited through a termhood measure and the latter through word embeddings. Our initial results were validated manually and show that combining a traditional ATE method that focuses on dissimilarities between corpora to newer methods that exploit similarities (more specifically distributional features of candidates) leads to promising results.
Anthology ID:
2020.computerm-1.1
Volume:
Proceedings of the 6th International Workshop on Computational Terminology
Month:
May
Year:
2020
Address:
Marseille, France
Venues:
CompuTerm | LREC | WS
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1–7
Language:
English
URL:
https://www.aclweb.org/anthology/2020.computerm-1.1
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/2020.computerm-1.1.pdf