Alleviating Digitization Errors in Named Entity Recognition for Historical Documents
Emanuela Boros | Ahmed Hamdi | Elvys Linhares Pontes | Luis Adrián Cabrera-Diego | Jose G. Moreno | Nicolas Sidere | Antoine Doucet
Proceedings of the 24th Conference on Computational Natural Language Learning

This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.


Classification and Optimization Algorithms: the LIA/ADOC participation at DEFT’14 (Algorithmes de classification et d’optimisation : participation du LIA/ADOC à DEFT’14) [in French]
Luis Adrián Cabrera-Diego | Stéphane Huet | Bassam Jabaian | Alejandro Molina | Juan-Manuel Torres-Moreno | Marc El-Bèze | Barthélémy Durette
TALN-RECITAL 2014 Workshop DEFT 2014 : DÉfi Fouille de Textes (DEFT 2014 Workshop: Text Mining Challenge)


SegCV : Eficient parsing of résumés with analysis and correction of errors (SegCV : traitement efficace de CV avec analyse et correction d’erreurs) [in French]
Luis Adrián Cabrera-Diego | Juan-Manuel Torres-Moreno | Marc El-Bèze
Proceedings of TALN 2013 (Volume 2: Short Papers)


Using Wikipedia to Validate the Terminology found in a Corpus of Basic Textbooks
Jorge Vivaldi | Luis Adrián Cabrera-Diego | Gerardo Sierra | María Pozzi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

A scientific vocabulary is a set of terms that designate scientific concepts. This set of lexical units can be used in several applications ranging from the development of terminological dictionaries and machine translation systems to the development of lexical databases and beyond. Even though automatic term recognition systems exist since the 80s, this process is still mainly done by hand, since it generally yields more accurate results, although not in less time and at a higher cost. Some of the reasons for this are the fairly low precision and recall results obtained, the domain dependence of existing tools and the lack of available semantic knowledge needed to validate these results. In this paper we present a method that uses Wikipedia as a semantic knowledge resource, to validate term candidates from a set of scientific text books used in the last three years of high school for mathematics, health education and ecology. The proposed method may be applied to any domain or language (assuming there is a minimal coverage by Wikipedia).


The RST Spanish Treebank On-line Interface
Iria da Cunha | Juan-Manuel Torres-Moreno | Gerardo Sierra | Luis-Adrián Cabrera-Diego | Brenda-Gabriela Castro-Rolón | Juan-Miguel Rolland Bartilotti
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011