Antoine Doucet


pdf bib
Alleviating Digitization Errors in Named Entity Recognition for Historical Documents
Emanuela Boros | Ahmed Hamdi | Elvys Linhares Pontes | Luis Adrián Cabrera-Diego | Jose G. Moreno | Nicolas Sidere | Antoine Doucet
Proceedings of the 24th Conference on Computational Natural Language Learning

This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.

pdf bib
Dataset for Temporal Analysis of English-French Cognates
Esteban Frossard | Mickael Coustaty | Antoine Doucet | Adam Jatowt | Simon Hengchen
Proceedings of the 12th Language Resources and Evaluation Conference

Languages change over time and, thanks to the abundance of digital corpora, their evolutionary analysis using computational techniques has recently gained much research attention. In this paper, we focus on creating a dataset to support investigating the similarity in evolution between different languages. We look in particular into the similarities and differences between the use of corresponding words across time in English and French, two languages from different linguistic families yet with shared syntax and close contact. For this we select a set of cognates in both languages and study their frequency changes and correlations over time. We propose a new dataset for computational approaches of synchronized diachronic investigation of language pairs, and subsequently show novel findings stemming from the cognate-focused diachronic comparison of the two chosen languages. To the best of our knowledge, the present study is the first in the literature to use computational approaches and large data to make a cross-language diachronic analysis.

pdf bib
A Dataset for Multi-lingual Epidemiological Event Extraction
Stephen Mutuvi | Antoine Doucet | Gaël Lejeune | Moses Odeo
Proceedings of the 12th Language Resources and Evaluation Conference

This paper proposes a corpus for the development and evaluation of tools and techniques for identifying emerging infectious disease threats in online news text. The corpus can not only be used for information extraction, but also for other natural language processing (NLP) tasks such as text classification. We make use of articles published on the Program for Monitoring Emerging Diseases (ProMED) platform, which provides current information about outbreaks of infectious diseases globally. Among the key pieces of information present in the articles is the uniform resource locator (URL) to the online news sources where the outbreaks were originally reported. We detail the procedure followed to build the dataset, which includes leveraging the source URLs to retrieve the news reports and subsequently pre-processing the retrieved documents. We also report on experimental results of event extraction on the dataset using the Data Analysis for Information Extraction in any Language(DAnIEL) system. DAnIEL is a multilingual news surveillance system that leverages unique attributes associated with news reporting to extract events: repetition and saliency. The system has wide geographical and language coverage, including low-resource languages. In addition, we compare different classification approaches in terms of their ability to differentiate between epidemic-related and unrelated news articles that constitute the corpus.

pdf bib
Multilingual Epidemiological Text Classification: A Comparative Study
Stephen Mutuvi | Emanuela Boros | Antoine Doucet | Adam Jatowt | Gaël Lejeune | Moses Odeo
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we approach the multilingual text classification task in the context of the epidemiological field. Multilingual text classification models tend to perform differently across different languages (low- or high-resourced), more particularly when the dataset is highly imbalanced, which is the case for epidemiological datasets. We conduct a comparative study of different machine and deep learning text classification models using a dataset comprising news articles related to epidemic outbreaks from six languages, four low-resourced and two high-resourced, in order to analyze the influence of the nature of the language, the structure of the document, and the size of the data. Our findings indicate that the performance of the models based on fine-tuned language models exceeds by more than 50% the chosen baseline models that include a specialized epidemiological news surveillance system and several machine learning models. Also, low-resource languages are highly influenced not only by the typology of the languages on which the models have been pre-trained or/and fine-tuned but also by their size. Furthermore, we discover that the beginning and the end of documents provide the most salient features for this task and, as expected, the performance of the models was proportionate to the training data size.


pdf bib
TLR at BSNLP2019: A Multilingual Named Entity Recognition System
Jose G. Moreno | Elvys Linhares Pontes | Mickael Coustaty | Antoine Doucet
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

This paper presents our participation at the shared task on multilingual named entity recognition at BSNLP2019. Our strategy is based on a standard neural architecture for sequence labeling. In particular, we use a mixed model which combines multilingualcontextual and language-specific embeddings. Our only submitted run is based on a voting schema using multiple models, one for each of the four languages of the task (Bulgarian, Czech, Polish, and Russian) and another for English. Results for named entity recognition are encouraging for all languages, varying from 60% to 83% in terms of Strict and Relaxed metrics, respectively.


pdf bib
The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions
Agata Savary | Carlos Ramisch | Silvio Cordeiro | Federico Sangati | Veronika Vincze | Behrang QasemiZadeh | Marie Candito | Fabienne Cap | Voula Giouli | Ivelina Stoyanova | Antoine Doucet
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

Multiword expressions (MWEs) are known as a “pain in the neck” for NLP due to their idiosyncratic behaviour. While some categories of MWEs have been addressed by many studies, verbal MWEs (VMWEs), such as to take a decision, to break one’s heart or to turn off, have been rarely modelled. This is notably due to their syntactic variability, which hinders treating them as “words with spaces”. We describe an initiative meant to bring about substantial progress in understanding, modelling and processing VMWEs. It is a joint effort, carried out within a European research network, to elaborate universal terminologies and annotation guidelines for 18 languages. Its main outcome is a multilingual 5-million-word annotated corpus which underlies a shared task on automatic identification of VMWEs. This paper presents the corpus annotation methodology and outcome, the shared task organisation and the results of the participating systems.

pdf bib
Neural Networks for Multi-Word Expression Detection
Natalia Klyueva | Antoine Doucet | Milan Straka
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

In this paper we describe the MUMULS system that participated to the 2017 shared task on automatic identification of verbal multiword expressions (VMWEs). The MUMULS system was implemented using a supervised approach based on recurrent neural networks using the open source library TensorFlow. The model was trained on a data set containing annotated VMWEs as well as morphological and syntactic information. The MUMULS system performed the identification of VMWEs in 15 languages, it was one of few systems that could categorize VMWEs type in nearly all languages.


pdf bib
“Let Everything Turn Well in Your Wife”: Generation of Adult Humor Using Lexical Constraints
Alessandro Valitutti | Hannu Toivonen | Antoine Doucet | Jukka M. Toivanen
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
DAnIEL, parsimonious yet high-coverage multilingual epidemic surveillance (DAnIEL : Veille épidémiologique multilingue parcimonieuse) [in French]
Gaël Lejeune | Romain Brixtel | Charlotte Lecluze | Antoine Doucet | Nadine Lucas
Proceedings of TALN 2013 (Volume 3: System Demonstrations)


pdf bib
Filtering news for epidemic surveillance: towards processing more languages with fewer resources
Gaël Lejeune | Antoine Doucet | Roman Yangarber | Nadine Lucas
Proceedings of the 4th Workshop on Cross Lingual Information Access


pdf bib
Non-Contiguous Word Sequences for Information Retrieval
Antoine Doucet | Helana Ahonen-Myka
Proceedings of the Workshop on Multiword Expressions: Integrating Processing