Emanuela Boroş

Also published as: Emanuela Boroș, Emanuela Boros


pdf bib
Alleviating Digitization Errors in Named Entity Recognition for Historical Documents
Emanuela Boros | Ahmed Hamdi | Elvys Linhares Pontes | Luis Adrián Cabrera-Diego | Jose G. Moreno | Nicolas Sidere | Antoine Doucet
Proceedings of the 24th Conference on Computational Natural Language Learning

This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.

pdf bib
Multilingual Epidemiological Text Classification: A Comparative Study
Stephen Mutuvi | Emanuela Boros | Antoine Doucet | Adam Jatowt | Gaël Lejeune | Moses Odeo
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we approach the multilingual text classification task in the context of the epidemiological field. Multilingual text classification models tend to perform differently across different languages (low- or high-resourced), more particularly when the dataset is highly imbalanced, which is the case for epidemiological datasets. We conduct a comparative study of different machine and deep learning text classification models using a dataset comprising news articles related to epidemic outbreaks from six languages, four low-resourced and two high-resourced, in order to analyze the influence of the nature of the language, the structure of the document, and the size of the data. Our findings indicate that the performance of the models based on fine-tuned language models exceeds by more than 50% the chosen baseline models that include a specialized epidemiological news surveillance system and several machine learning models. Also, low-resource languages are highly influenced not only by the typology of the languages on which the models have been pre-trained or/and fine-tuned but also by their size. Furthermore, we discover that the beginning and the end of documents provide the most salient features for this task and, as expected, the performance of the models was proportionate to the training data size.


pdf bib
Event Role Extraction using Domain-Relevant Word Representations
Emanuela Boroş | Romaric Besançon | Olivier Ferret | Brigitte Grau
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Event Role Labelling using a Neural Network Model (Étiquetage en rôles événementiels fondé sur l’utilisation d’un modèle neuronal) [in French]
Emanuela Boroş | Romaric Besançon | Olivier Ferret | Brigitte Grau
Proceedings of TALN 2014 (Volume 1: Long Papers)


pdf bib
Sentimatrix – Multilingual Sentiment Analysis Service
Alexandru-Lucian Gînscă | Emanuela Boroș | Adrian Iftene | Diana Trandabăț | Mihai Toader | Marius Corîci | Cenel-Augusto Perez | Dan Cristea
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011)