Mathieu Roche


2020

pdf bib
Information retrieval for animal disease surveillance: a pattern-based approach.
Sarah Valentin | Mathieu Roche | Renaud Lancelot
Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis

Animal diseases-related news articles are richin information useful for risk assessment. In this paper, we explore a method to automatically retrieve sentence-level epidemiological information. Our method is an incremental approach to create and expand patterns at both lexical and syntactic levels. Expert knowledge input are used at different steps of the approach. Distributed vector representations (word embedding) were used to expand the patterns at the lexical level, thus alleviating manual curation. We showed that expert validation was crucial to improve the precision of automatically generated patterns.

pdf bib
Automated Processing of Multilingual Online News for the Monitoring of Animal Infectious Diseases
Sarah Valentin | Renaud Lancelot | Mathieu Roche
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)

The Platform for Automated extraction of animal Disease Information from the web (PADI-web) is an automated system which monitors the web for monitoring and detecting emerging animal infectious diseases. The tool automatically collects news via customised multilingual queries, classifies them and extracts epidemiological information. We detail the processing of multilingual online sources by PADI-web and analyse the translated outputs in a case study

2018

pdf bib
Automatic Identification of Research Fields in Scientific Papers
Eric Kergosien | Amin Farvardin | Maguelonne Teisseire | Marie-Noëlle Bessagnet | Joachim Schöpfel | Stéphane Chaudiron | Bernard Jacquemin | Annig Lacayrelle | Mathieu Roche | Christian Sallaberry | Jean Philippe Tonneau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
Integration of Lexical and Semantic Knowledge for Sentiment Analysis in SMS
Wejdene Khiari | Mathieu Roche | Asma Bouhafs Hafsia
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

With the explosive growth of online social media (forums, blogs, and social networks), exploitation of these new information sources has become essential. Our work is based on the sud4science project. The goal of this project is to perform multidisciplinary work on a corpus of authentic SMS, in French, collected in 2011 and anonymised (88milSMS corpus: http://88milsms.huma-num.fr). This paper highlights a new method to integrate opinion detection knowledge from an SMS corpus by combining lexical and semantic information. More precisely, our approach gives more weight to words with a sentiment (i.e. presence of words in a dedicated dictionary) for a classification task based on three classes: positive, negative, and neutral. The experiments were conducted on two corpora: an elongated SMS corpus (i.e. repetitions of characters in messages) and a non-elongated SMS corpus. We noted that non-elongated SMS were much better classified than elongated SMS. Overall, this study highlighted that the integration of semantic knowledge always improves classification.

pdf bib
Automatic Biomedical Term Polysemy Detection
Juan Antonio Lossio-Ventura | Clement Jonquet | Mathieu Roche | Maguelonne Teisseire
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Polysemy is the capacity for a word to have multiple meanings. Polysemy detection is a first step for Word Sense Induction (WSI), which allows to find different meanings for a term. The polysemy detection is also important for information extraction (IE) systems. In addition, the polysemy detection is important for building/enriching terminologies and ontologies. In this paper, we present a novel approach to detect if a biomedical term is polysemic, with the long term goal of enriching biomedical ontologies. This approach is based on the extraction of new features. In this context we propose to extract features following two manners: (i) extracted directly from the text dataset, and (ii) from an induced graph. Our method obtains an Accuracy and F-Measure of 0.978.

pdf bib
Monitoring Disease Outbreak Events on the Web Using Text-mining Approach and Domain Expert Knowledge
Elena Arsevska | Mathieu Roche | Sylvain Falala | Renaud Lancelot | David Chavernac | Pascal Hendrikx | Barbara Dufour
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Timeliness and precision for detection of infectious animal disease outbreaks from the information published on the web is crucial for prevention against their spread. We propose a generic method to enrich and extend the use of different expressions as queries in order to improve the acquisition of relevant disease related pages on the web. Our method combines a text mining approach to extract terms from corpora of relevant disease outbreak documents, and domain expert elicitation (Delphi method) to propose expressions and to select relevant combinations between terms obtained with text mining. In this paper we evaluated the performance as queries of a number of expressions obtained with text mining and validated by a domain expert and expressions proposed by a panel of 21 domain experts. We used African swine fever as an infectious animal disease model. The expressions obtained with text mining outperformed as queries the expressions proposed by domain experts. However, domain experts proposed expressions not extracted automatically. Our method is simple to conduct and flexible to adapt to any other animal infectious disease and even in the public health domain.

pdf bib
Découverte de nouvelles entités et relations spatiales à partir d’un corpus de SMS (Discovering of new Spatial Entities and Relations from SMS Within the context of the currently available data masses, many works related to the analysis of spatial information are based on the exploitation of textual data)
Sarah Zenasni | Maguelonne Teisseire | Mathieu Roche | Eric Kergosien
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Dans le contexte des masses de données aujourd’hui disponibles, de nombreux travaux liés à l’analyse de l’information spatiale s’appuient sur l’exploitation des données textuelles. La communication médiée (SMS, tweets, etc.) véhiculant des informations spatiales prend une place prépondérante. L’objectif du travail présenté dans cet article consiste à extraire ces informations spatiales à partir d’un corpus authentique de SMS en français. Nous proposons un processus dans lequel, dans un premier temps, nous extrayons de nouvelles entités spatiales (par exemple, motpellier, montpeul à associer au toponyme Montpellier). Dans un second temps, nous identifions de nouvelles relations spatiales qui précèdent les entités spatiales (par exemple, sur, par, pres, etc.). La tâche est difficile et complexe en raison de la spécificité du langage SMS qui repose sur une écriture peu standardisée (apparition de nombreux lexiques, utilisation massive d’abréviations, variation par rapport à l’écrit classique, etc.). Les expérimentations qui ont été réalisées à partir du corpus 88milSMS mettent en relief la robustesse de notre système pour identifier de nouvelles entités et relations spatiales.

2015

pdf bib
Identification des unités de mesure dans les textes scientifiques
Soumia Lilia Berrahou | Patrice Buche | Juliette Dibie-Barthélemy | Mathieu Roche
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Le travail présenté dans cet article se situe dans le cadre de l’identification de termes spécialisés (unités de mesure) à partir de données textuelles pour enrichir une Ressource Termino-Ontologique (RTO). La première étape de notre méthode consiste à prédire la localisation des variants d’unités de mesure dans les documents. Nous avons utilisé une méthode reposant sur l’apprentissage supervisé. Cette méthode permet de réduire sensiblement l’espace de recherche des variants tout en restant dans un contexte optimal de recherche (réduction de 86% de l’espace de recherché sur le corpus étudié). La deuxième étape du processus, une fois l’espace de recherche réduit aux variants d’unités, utilise une nouvelle mesure de similarité permettant d’identifier automatiquement les variants découverts par rapport à un terme d’unité déjà référencé dans la RTO avec un taux de précision de 82% pour un seuil au dessus de 0.6 sur le corpus étudié.

2014

pdf bib
Automatic Term Extraction Combining Different Information (Extraction automatique de termes combinant différentes informations) [in French]
Juan Antonio Lossio-Ventura | Clement Jonquet | Mathieu Roche | Maguelonne Teisseire
Proceedings of TALN 2014 (Volume 2: Short Papers)

pdf bib
Towards Electronic SMS Dictionary Construction: An Alignment-based Approach
Cédric Lopez | Reda Bestandji | Mathieu Roche | Rachel Panckhurst
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we propose a method for aligning text messages (entitled AlignSMS) in order to automatically build an SMS dictionary. An extract of 100 text messages from the 88milSMS corpus (Panckhurst el al., 2013, 2014) was used as an initial test. More than 90,000 authentic text messages in French were collected from the general public by a group of academics in the south of France in the context of the sud4science project (http://www.sud4science.org). This project is itself part of a vast international SMS data collection project, entitled sms4science (http://www.sms4science.org, Fairon et al. 2006, Cougnon, 2014). After corpus collation, pre-processing and anonymisation (Accorsi et al., 2012, Patel et al., 2013), we discuss how “raw” anonymised text messages can be transcoded into normalised text messages, using a statistical alignment method. The future objective is to set up a hybrid (symbolic/statistic) approach based on both grammar rules and our statistical AlignSMS method.

2012

pdf bib
NOMIT: Automatic Titling by Nominalizing
Cédric Lopez | Violaine Prince | Mathieu Roche
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Just Title It! (by an Online Application)
Cédric Lopez | Violaine Prince | Mathieu Roche
Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf bib
Automatic titling of Articles Using Position and Statistical Information
Cédric Lopez | Violaine Prince | Mathieu Roche
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2010

pdf bib
How to Expand Dictionaries by Web-Mining Techniques
Nicolas Béchet | Mathieu Roche
Proceedings of the 2nd Workshop on Cognitive Aspects of the Lexicon