Olivier Ferret


2020

pdf bib
Which Dependency Parser to Use for Distributional Semantics in a Specialized Domain?
Pauline Brunet | Olivier Ferret | Ludovic Tanguy
Proceedings of the 6th International Workshop on Computational Terminology

We present a study whose objective is to compare several dependency parsers for English applied to a specialized corpus for building distributional count-based models from syntactic dependencies. One of the particularities of this study is to focus on the concepts of the target domain, which mainly occur in documents as multi-terms and must be aligned with the outputs of the parsers. We compare a set of ten parsers in terms of syntactic triplets but also in terms of distributional neighbors extracted from the models built from these triplets, both with and without an external reference concerning the semantic relations between concepts. We show more particularly that some patterns of proximity between these parsers can be observed across our different evaluations, which could give insights for anticipating the performance of a parser for building distributional models from a given corpus

pdf bib
Building a Multimodal Entity Linking Dataset From Tweets
Omar Adjali | Romaric Besançon | Olivier Ferret | Hervé Le Borgne | Brigitte Grau
Proceedings of the 12th Language Resources and Evaluation Conference

The task of Entity linking, which aims at associating an entity mention with a unique entity in a knowledge base (KB), is useful for advanced Information Extraction tasks such as relation extraction or event detection. Most of the studies that address this problem rely only on textual documents while an increasing number of sources are multimedia, in particular in the context of social media where messages are often illustrated with images. In this article, we address the Multimodal Entity Linking (MEL) task, and more particularly the problem of its evaluation. To this end, we propose a novel method to quasi-automatically build annotated datasets to evaluate methods on the MEL task. The method collects text and images to jointly build a corpus of tweets with ambiguous mentions along with a Twitter KB defining the entities. We release a new annotated dataset of Twitter posts associated with images. We study the key characteristics of the proposed dataset and evaluate the performance of several MEL approaches on it.

pdf bib
Extrinsic Evaluation of French Dependency Parsers on a Specialized Corpus: Comparison of Distributional Thesauri
Ludovic Tanguy | Pauline Brunet | Olivier Ferret
Proceedings of the 12th Language Resources and Evaluation Conference

We present a study in which we compare 11 different French dependency parsers on a specialized corpus (consisting of research articles on NLP from the proceedings of the TALN conference). Due to the lack of a suitable gold standard, we use each of the parsers’ output to generate distributional thesauri using a frequency-based method. We compare these 11 thesauri to assess the impact of choosing a parser over another. We show that, without any reference data, we can still identify relevant subsets among the different parsers. We also show that the similarity we identify between parsers is confirmed on a restricted distributional benchmark.

pdf bib
Représentation dynamique et spécifique du contexte textuel pour l’extraction d’événements (Dynamic and specific textual context representation for event extraction)
Dorian Kodelja | Romaric Besançon | Olivier Ferret
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Dans cet article, focalisé sur l’extraction supervisée de mentions d’événements dans les textes, nous proposons d’étendre un modèle opérant au niveau phrastique et reposant sur une architecture neuronale de convolution de graphe exploitant les dépendances syntaxiques. Nous y intégrons pour ce faire un contexte plus large au travers de la représentation de phrases distantes sélectionnées sur la base de relations de coréférence entre entités. En outre, nous montrons l’intérêt d’une telle intégration au travers d’évaluations menées sur le corpus de référence TAC Event 2015.

pdf bib
Modèle neuronal pour la résolution de la coréférence dans les dossiers médicaux électroniques (Neural approach for coreference resolution in electronic health records )
Julien Tourille | Olivier Ferret | Aurélie Névéol | Xavier Tannier
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

La résolution de la coréférence est un élément essentiel pour la constitution automatique de chronologies médicales à partir des dossiers médicaux électroniques. Dans ce travail, nous présentons une approche neuronale pour la résolution de la coréférence dans des textes médicaux écrits en anglais pour les entités générales et cliniques en nous évaluant dans le cadre de référence pour cette tâche que constitue la tâche 1C de la campagne i2b2 2011.

pdf bib
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
Hicham El Boukkouri | Olivier Ferret | Thomas Lavergne | Hiroshi Noji | Pierre Zweigenbaum | Jun’ichi Tsujii
Proceedings of the 28th International Conference on Computational Linguistics

Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level, and open-vocabulary representations.

2019

pdf bib
Comparaison qualitative et extrinsèque d’analyseurs syntaxiques du français : confrontation de modèles distributionnels sur un corpus spécialisé (Extrinsic evaluation of French dependency parsers on a specialised corpus : comparison of distributional thesauri )
Ludovic Tanguy | Pauline Brunet | Olivier Ferret
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume I : Articles longs

Nous présentons une étude visant à comparer 11 différents analyseurs en dépendances du français sur un corpus spécialisé (constitué des archives des articles de la conférence TALN). En l’absence de gold standard, nous utilisons chacune des sorties de ces analyseurs pour construire des thésaurus distributionnels en utilisant une méthode à base de fréquence. Nous comparons ces 11 thésaurus afin de proposer un premier aperçu de l’impact du choix d’un analyseur par rapport à un autre.

pdf bib
Embedding Strategies for Specialized Domains: Application to Clinical Entity Recognition
Hicham El Boukkouri | Olivier Ferret | Thomas Lavergne | Pierre Zweigenbaum
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Using pre-trained word embeddings in conjunction with Deep Learning models has become the “de facto” approach in Natural Language Processing (NLP). While this usually yields satisfactory results, off-the-shelf word embeddings tend to perform poorly on texts from specialized domains such as clinical reports. Moreover, training specialized word representations from scratch is often either impossible or ineffective due to the lack of large enough in-domain data. In this work, we focus on the clinical domain for which we study embedding strategies that rely on general-domain resources only. We show that by combining off-the-shelf contextual embeddings (ELMo) with static word2vec embeddings trained on a small in-domain corpus built from the task data, we manage to reach and sometimes outperform representations learned from a large corpus in the medical domain.

2018

pdf bib
Using pseudo-senses for improving the extraction of synonyms from word embeddings
Olivier Ferret
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

The methods proposed recently for specializing word embeddings according to a particular perspective generally rely on external knowledge. In this article, we propose Pseudofit, a new method for specializing word embeddings according to semantic similarity without any external knowledge. Pseudofit exploits the notion of pseudo-sense for building several representations for each word and uses these representations for making the initial embeddings more generic. We illustrate the interest of Pseudofit for acquiring synonyms and study several variants of Pseudofit according to this perspective.

pdf bib
Evaluation of a Sequence Tagging Tool for Biomedical Texts
Julien Tourille | Matthieu Doutreligne | Olivier Ferret | Aurélie Névéol | Nicolas Paris | Xavier Tannier
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

Many applications in biomedical natural language processing rely on sequence tagging as an initial step to perform more complex analysis. To support text analysis in the biomedical domain, we introduce Yet Another SEquence Tagger (YASET), an open-source multi purpose sequence tagger that implements state-of-the-art deep learning algorithms for sequence tagging. Herein, we evaluate YASET on part-of-speech tagging and named entity recognition in a variety of text genres including articles from the biomedical literature in English and clinical narratives in French. To further characterize performance, we report distributions over 30 runs and different sizes of training datasets. YASET provides state-of-the-art performance on the CoNLL 2003 NER dataset (F1=0.87), MEDPOST corpus (F1=0.97), MERLoT corpus (F1=0.99) and NCBI disease corpus (F1=0.81). We believe that YASET is a versatile and efficient tool that can be used for sequence tagging in biomedical and clinical texts.

pdf bib
Intégration de contexte global par amorçage pour la détection d’événements (Integrating global context via bootstrapping for event detection)
Dorian Kodelja | Romaric Besançon | Olivier Ferret
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Les approches neuronales obtiennent depuis plusieurs années des résultats intéressants en extraction d’événements. Cependant, les approches développées dans ce cadre se limitent généralement à un contexte phrastique. Or, si certains types d’événements sont aisément identifiables à ce niveau, l’exploitation d’indices présents dans d’autres phrases est parfois nécessaire pour permettre de désambiguïser des événements. Dans cet article, nous proposons ainsi l’intégration d’une représentation d’un contexte plus large pour améliorer l’apprentissage d’un réseau convolutif. Cette représentation est obtenue par amorçage en exploitant les résultats d’un premier modèle convolutif opérant au niveau phrastique. Dans le cadre d’une évaluation réalisée sur les données de la campagne TAC 2017, nous montrons que ce modèle global obtient un gain significatif par rapport au modèle local, ces deux modèles étant eux-mêmes compétitifs par rapport aux résultats de TAC 2017. Nous étudions également en détail le gain de performance de notre nouveau modèle au travers de plusieurs expériences complémentaires.

pdf bib
Utilisation de Représentations Distribuées de Relations pour la Désambiguïsation d’Entités Nommées (Exploiting Relation Embeddings to Improve Entity Linking )
Nicolas Wagner | Romaric Besançon | Olivier Ferret
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

L’identification des entités nommées dans un texte est une étape fondamentale pour de nombreuses tâches d’extraction d’information. Pour avoir une identification complète, une étape de désambiguïsation des entités similaires doit être réalisée. Celle-ci s’appuie souvent sur la seule description textuelle des entités. Or, les bases de connaissances contiennent des informations plus riches, sous la forme de relations entre les entités : cette information peut également être exploitée pour améliorer la désambiguïsation des entités. Nous proposons dans cet article une approche d’apprentissage de représentations distribuées de ces relations et leur utilisation pour la tâche de désambiguïsation d’entités nommées. Nous montrons le gain de cette méthode sur un corpus d’évaluation standard, en anglais, issu de la tâche de désambiguïsation d’entités de la campagne TAC-KBP.

pdf bib
Des pseudo-sens pour améliorer l’extraction de synonymes à partir de plongements lexicaux (Pseudo-senses for improving the extraction of synonyms from word embeddings)
Olivier Ferret
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Au-delà des modèles destinés à construire des plongements lexicaux à partir de corpus, des méthodes de spécialisation de ces représentations selon différentes orientations ont été proposées. Une part importante d’entre elles repose sur l’utilisation de connaissances externes. Dans cet article, nous proposons Pseudofit, une nouvelle méthode de spécialisation de plongements lexicaux focalisée sur la similarité sémantique et opérant sans connaissances externes. Pseudofit s’appuie sur la notion de pseudo-sens afin d’obtenir plusieurs représentations pour un même mot et utilise cette pluralité pour rendre plus génériques les plongements initiaux. Nous illustrons l’intérêt de Pseudofit pour l’extraction de synonymes et nous explorons dans ce cadre différentes variantes visant à en améliorer les résultats.

2017

pdf bib
Construire des représentations denses à partir de thésaurus distributionnels (Distributional Thesaurus Embedding and its Applications)
Olivier Ferret
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs

Dans cet article, nous nous intéressons à un nouveau problème, appelé plongement de thésaurus, consistant à transformer un thésaurus distributionnel en une représentation dense de mots. Nous proposons de traiter ce problème par une méthode fondée sur l’association d’un plongement de graphe et de l’injection de relations dans des représentations denses. Nous avons appliqué et évalué cette méthode pour un large ensemble de noms en anglais et montré que les représentations denses produites obtiennent de meilleures performances, selon une évaluation intrinsèque, que les représentations denses construites selon les méthodes de l’état de l’art sur le même corpus. Nous illustrons aussi l’intérêt de la méthode développée pour améliorer les représentations denses existantes à la fois de façon endogène et exogène.

pdf bib
Neural Architecture for Temporal Relation Extraction: A Bi-LSTM Approach for Detecting Narrative Containers
Julien Tourille | Olivier Ferret | Aurélie Névéol | Xavier Tannier
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present a neural architecture for containment relation identification between medical events and/or temporal expressions. We experiment on a corpus of de-identified clinical notes in English from the Mayo Clinic, namely the THYME corpus. Our model achieves an F-measure of 0.613 and outperforms the best result reported on this corpus to date.

pdf bib
Temporal information extraction from clinical text
Julien Tourille | Olivier Ferret | Xavier Tannier | Aurélie Névéol
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

In this paper, we present a method for temporal relation extraction from clinical narratives in French and in English. We experiment on two comparable corpora, the MERLOT corpus and the THYME corpus, and show that a common approach can be used for both languages.

pdf bib
Turning Distributional Thesauri into Word Vectors for Synonym Extraction and Expansion
Olivier Ferret
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this article, we propose to investigate a new problem consisting in turning a distributional thesaurus into dense word vectors. We propose more precisely a method for performing such task by associating graph embedding and distributed representation adaptation. We have applied and evaluated it for English nouns at a large scale about its ability to retrieve synonyms. In this context, we have also illustrated the interest of the developed method for three different tasks: the improvement of already existing word embeddings, the fusion of heterogeneous representations and the expansion of synsets.

pdf bib
Taking into account Inter-sentence Similarity for Update Summarization
Maâli Mnasri | Gaël de Chalendar | Olivier Ferret
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Following Gillick and Favre (2009), a lot of work about extractive summarization has modeled this task by associating two contrary constraints: one aims at maximizing the coverage of the summary with respect to its information content while the other represents its size limit. In this context, the notion of redundancy is only implicitly taken into account. In this article, we extend the framework defined by Gillick and Favre (2009) by examining how and to what extent integrating semantic sentence similarity into an update summarization system can improve its results. We show more precisely the impact of this strategy through evaluations performed on DUC 2007 and TAC 2008 and 2009 datasets.

pdf bib
LIMSI-COT at SemEval-2017 Task 12: Neural Architecture for Temporal Information Extraction from Clinical Narratives
Julien Tourille | Olivier Ferret | Xavier Tannier | Aurélie Névéol
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper we present our participation to SemEval 2017 Task 12. We used a neural network based approach for entity and temporal relation extraction, and experimented with two domain adaptation strategies. We achieved competitive performance for both tasks.

pdf bib
Unsupervised Event Clustering and Aggregation from Newswire and Web Articles
Swen Ribeiro | Olivier Ferret | Xavier Tannier
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

In this paper, we present an unsupervised pipeline approach for clustering news articles based on identified event instances in their content. We leverage press agency newswire and monolingual word alignment techniques to build meaningful and linguistically varied clusters of articles from the web in the perspective of a broader event type detection task. We validate our approach on a manually annotated corpus of Web articles.

2016

pdf bib
A Dataset for Open Event Extraction in English
Kiem-Hieu Nguyen | Xavier Tannier | Olivier Ferret | Romaric Besançon
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This article presents a corpus for development and testing of event schema induction systems in English. Schema induction is the task of learning templates with no supervision from unlabeled texts, and to group together entities corresponding to the same role in a template. Most of the previous work on this subject relies on the MUC-4 corpus. We describe the limits of using this corpus (size, non-representativeness, similarity of roles across templates) and propose a new, partially-annotated corpus in English which remedies some of these shortcomings. We make use of Wikinews to select the data inside the category Laws & Justice, and query Google search engine to retrieve different documents on the same events. Only Wikinews documents are manually annotated and can be used for evaluation, while the others can be used for unsupervised learning. We detail the methodology used for building the corpus and evaluate some existing systems on this new data.

pdf bib
Utilisation des relations d’une base de connaissances pour la désambiguïsation d’entités nommées (Using the Relations of a Knowledge Base to Improve Entity Linking )
Romaric Besançon | Hani Daher | Olivier Ferret | Hervé Le Borgne
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

L’identification des entités nommées dans un texte est une tâche essentielle des outils d’extraction d’information dans de nombreuses applications. Cette identification passe par la reconnaissance d’une mention d’entité dans le texte, ce qui a été très largement étudié, et par l’association des entités reconnues à des entités connues, présentes dans une base de connaissances. Cette association repose souvent sur une mesure de similarité entre le contexte textuel de la mention de l’entité et un contexte textuel de description des entités de la base de connaissances. Or, ce contexte de description n’est en général pas présent pour toutes les entités. Nous proposons d’exploiter les relations de la base de connaissances pour ajouter un indice de désambiguïsation pour ces entités. Nous évaluons notre travail sur des corpus d’évaluation standards en anglais issus de la tâche de désambiguïsation d’entités de la campagne TAC-KBP.

pdf bib
Extraction de relations temporelles dans des dossiers électroniques patient (Extracting Temporal Relations from Electronic Health Records)
Julien Tourille | Olivier Ferret | Aurélie Névéol | Xavier Tannier
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

L’analyse temporelle des documents cliniques permet d’obtenir des représentations riches des informations contenues dans les dossiers électroniques patient. Cette analyse repose sur l’extraction d’événements, d’expressions temporelles et des relations entre eux. Dans ce travail, nous considérons que nous disposons des événements et des expressions temporelles pertinents et nous nous intéressons aux relations temporelles entre deux événements ou entre un événement et une expression temporelle. Nous présentons des modèles de classification supervisée pour l’extraction de des relations en français et en anglais. Les performances obtenues sont comparables dans les deux langues, suggérant ainsi que différents domaines cliniques et différentes langues pourraient être abordés de manière similaire.

pdf bib
Intégration de la similarité entre phrases comme critère pour le résumé multi-document (Integrating sentence similarity as a constraint for multi-document summarization)
Maâli Mnasri | Gaël de Chalendar | Olivier Ferret
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

multi-document Maâli Mnasri1, 2 Gaël de Chalendar1 Olivier Ferret1 (1) CEA, LIST, Laboratoire Vision et Ingénierie des Contenus, Gif-sur-Yvette, F-91191, France. (2) Université Paris-Sud, Université Paris-Saclay, F-91405 Orsay, France. maali.mnasri@cea.fr, gael.de-chalendar@cea.fr, olivier.ferret@cea.fr R ÉSUMÉ À la suite des travaux de Gillick & Favre (2009), beaucoup de travaux portant sur le résumé par extraction se sont appuyés sur une modélisation de cette tâche sous la forme de deux contraintes antagonistes : l’une vise à maximiser la couverture du résumé produit par rapport au contenu des textes d’origine tandis que l’autre représente la limite du résumé en termes de taille. Dans cette approche, la notion de redondance n’est prise en compte que de façon implicite. Dans cet article, nous reprenons le cadre défini par Gillick & Favre (2009) mais nous examinons comment et dans quelle mesure la prise en compte explicite de la similarité sémantique des phrases peut améliorer les performances d’un système de résumé multi-document. Nous vérifions cet impact par des évaluations menées sur les corpus DUC 2003 et 2004.

pdf bib
LIMSI-COT at SemEval-2016 Task 12: Temporal relation identification using a pipeline of classifiers
Julien Tourille | Olivier Ferret | Aurélie Névéol | Xavier Tannier
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Désambiguïsation d’entités pour l’induction non supervisée de schémas événementiels
Kiem-Hieu Nguyen | Xavier Tannier | Olivier Ferret | Romaric Besançon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente un modèle génératif pour l’induction non supervisée d’événements. Les précédentes méthodes de la littérature utilisent uniquement les têtes des syntagmes pour représenter les entités. Pourtant, le groupe complet (par exemple, ”un homme armé”) apporte une information plus discriminante (que ”homme”). Notre modèle tient compte de cette information et la représente dans la distribution des schémas d’événements. Nous montrons que ces relations jouent un rôle important dans l’estimation des paramètres, et qu’elles conduisent à des distributions plus cohérentes et plus discriminantes. Les résultats expérimentaux sur le corpus de MUC-4 confirment ces progrès.

pdf bib
Déclasser les voisins non sémantiques pour améliorer les thésaurus distributionnels
Olivier Ferret
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

La plupart des méthodes d’amélioration des thésaurus distributionnels se focalisent sur les moyens – représentations ou mesures de similarité – de mieux détecter la similarité sémantique entre les mots. Dans cet article, nous proposons un point de vue inverse : nous cherchons à détecter les voisins sémantiques associés à une entrée les moins susceptibles d’être liés sémantiquement à elle et nous utilisons cette information pour réordonner ces voisins. Pour détecter les faux voisins sémantiques d’une entrée, nous adoptons une approche s’inspirant de la désambiguïsation sémantique en construisant un classifieur permettant de différencier en contexte cette entrée des autres mots. Ce classifieur est ensuite appliqué à un échantillon des occurrences des voisins de l’entrée pour repérer ceux les plus éloignés de l’entrée. Nous évaluons cette méthode pour des thésaurus construits à partir de cooccurrents syntaxiques et nous montrons l’intérêt de la combiner avec les méthodes décrites dans (Ferret, 2013b) selon une stratégie de type vote.

pdf bib
Generative Event Schema Induction with Entity Disambiguation
Kiem-Hieu Nguyen | Xavier Tannier | Olivier Ferret | Romaric Besançon
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Early and Late Combinations of Criteria for Reranking Distributional Thesauri
Olivier Ferret
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
Using a generic neural model for lexical substitution (Utiliser un modèle neuronal générique pour la substitution lexicale) [in French]
Olivier Ferret
TALN-RECITAL 2014 Workshop SemDis 2014 : Enjeux actuels de la sémantique distributionnelle (SemDis 2014: Current Challenges in Distributional Semantics)

pdf bib
Improving distributional thesauri by exploring the graph of neighbors
Vincent Claveau | Ewa Kijak | Olivier Ferret
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Event Role Extraction using Domain-Relevant Word Representations
Emanuela Boroş | Romaric Besançon | Olivier Ferret | Brigitte Grau
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Event Role Labelling using a Neural Network Model (Étiquetage en rôles événementiels fondé sur l’utilisation d’un modèle neuronal) [in French]
Emanuela Boroş | Romaric Besançon | Olivier Ferret | Brigitte Grau
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Exploring the neighbor graph to improve distributional thesauri (Explorer le graphe de voisinage pour améliorer les thésaurus distributionnels) [in French]
Vincent Claveau | Ewa Kijak | Olivier Ferret
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Evaluation of different strategies for domain adaptation in opinion mining
Anne Garcia-Fernandez | Olivier Ferret | Marco Dinarelli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The work presented in this article takes place in the field of opinion mining and aims more particularly at finding the polarity of a text by relying on machine learning methods. In this context, it focuses on studying various strategies for adapting a statistical classifier to a new domain when training data only exist for one or several other domains. This study shows more precisely that a self-training procedure consisting in enlarging the initial training corpus with texts from the target domain that were reliably classified by the classifier is the most successful and stable strategy for the tested domains. Moreover, this strategy gets better results in most cases than (Blitzer et al., 2007)’s method on the same evaluation corpus while it is more simple.

pdf bib
Compounds and distributional thesauri
Olivier Ferret
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The building of distributional thesauri from corpora is a problem that was the focus of a significant number of articles, starting with (Grefenstette, 1994} and followed by (Lin, 1998}, (Curran and Moens, 2002) or (Heylen and Peirsman, 2007). However, in all these cases, only single terms were considered. More recently, the topic of compositionality in the framework of distributional semantic representations has come to the surface and was investigated for building the semantic representation of phrases or even sentences from the representation of their words. However, this work was not done until now with the objective of building distributional thesauri. In this article, we investigate the impact of the introduction of compounds for achieving such building. More precisely, we consider compounds as undividable lexical units and evaluate their influence according to three different roles: as features in the distributional contexts of single terms, as possible neighbors of single term entries and finally, as entries of a thesaurus. This investigation was conducted through an intrinsic evaluation for a large set of nominal English single terms and compounds with various frequencies.

2013

pdf bib
Identifying Bad Semantic Neighbors for Improving Distributional Thesauri
Olivier Ferret
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Unsupervised selection of semantic relations for improving a distributional thesaurus (Sélection non supervisée de relations sémantiques pour améliorer un thésaurus distributionnel) [in French]
Olivier Ferret
Proceedings of TALN 2013 (Volume 1: Long Papers)

pdf bib
Semantic relation clustering for unsupervised information extraction (Regroupement sémantique de relations pour l’extraction d’information non supervisée) [in French]
Wei Wang | Romaric Besançon | Olivier Ferret | Brigitte Grau
Proceedings of TALN 2013 (Volume 1: Long Papers)

2012

pdf bib
Une méthode d’extraction d’information fondée sur les graphes pour le remplissage de formulaires (A Graph-Based Method for Template Filling in Information Extraction) [in French]
Ludovic Jean-Louis | Romaric Besançon | Olivier Ferret
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Etude de différentes stratégies d’adaptation à un nouveau domaine en fouille d’opinion (Study of various strategies for adapting an opinion classifier to a new domain) [in French]
Anne Garcia-Fernandez | Olivier Ferret
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Evaluation of Unsupervised Information Extraction
Wei Wang | Romaric Besançon | Olivier Ferret | Brigitte Grau
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Unsupervised methods gain more and more attention nowadays in information extraction area, which allows to design more open extraction systems. In the domain of unsupervised information extraction, clustering methods are of particular importance. However, evaluating the results of clustering remains difficult at a large scale, especially in the absence of reliable reference. On the basis of our experiments on unsupervised relation extraction, we first discuss in this article how to evaluate clustering quality without a reference by relying on internal measures. Then we propose a method, supported by a dedicated annotation tool, for building a set of reference clusters of relations from a corpus. Moreover, we apply it to our experimental framework and illustrate in this way how to build a significant reference for unsupervised relation extraction, more precisely made of 80 clusters gathering more than 4,000 relation instances, in a short time. Finally, we present how such reference is exploited for the evaluation of clustering with external measures and analyze the results of the application of these measures to the clusters of relations produced by our unsupervised relation extraction system.

pdf bib
Evaluation of a Complex Information Extraction Application in Specific Domain
Romaric Besançon | Olivier Ferret | Ludovic Jean-Louis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Operational intelligence applications in specific domains are developed using numerous natural language processing technologies and tools. A challenge for this integration is to take into account the limitations of each of these technologies in the global evaluation of the application. We present in this article a complex intelligence application for the gathering of information from the Web about recent seismic events. We present the different components needed for the development of such system, including Information Extraction, Filtering and Clustering, and the technologies behind each component. We also propose an independent evaluation of each component and an insight of their influence in the overall performance of the system.

2011

pdf bib
Text Segmentation and Graph-based Method for Template Filling in Information Extraction
Ludovic Jean-Louis | Romaric Besançon | Olivier Ferret
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
A Corpus for Studying Full Answer Justification
Arnaud Grappy | Brigitte Grau | Olivier Ferret | Cyril Grouin | Véronique Moriceau | Isabelle Robba | Xavier Tannier | Anne Vilnat | Vincent Barbier
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Question answering (QA) systems aim at retrieving precise information from a large collection of documents. To be considered as reliable by users, a QA system must provide elements to evaluate the answer. This notion of answer justification can also be useful when developping a QA system in order to give criteria for selecting correct answers. An answer justification can be found in a sentence, a passage made of several consecutive sentences or several passages of a document or several documents. Thus, we are interesting in pinpointing the set of information that allows to verify the correctness of the answer in a candidate passage and the question elements that are missing in this passage. Moreover, the relevant information is often given in texts in a different form from the question form: anaphora, paraphrases, synonyms. In order to have a better idea of the importance of all the phenomena we underlined, and to provide enough examples at the QA developer's disposal to study them, we decided to build an annotated corpus.

pdf bib
LIMA : A Multilingual Framework for Linguistic Analysis and Linguistic Resources Development and Evaluation
Romaric Besançon | Gaël de Chalendar | Olivier Ferret | Faiza Gara | Olivier Mesnard | Meriama Laïb | Nasredine Semmar
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The increasing amount of available textual information makes necessary the use of Natural Language Processing (NLP) tools. These tools have to be used on large collections of documents in different languages. But NLP is a complex task that relies on many processes and resources. As a consequence, NLP tools must be both configurable and efficient: specific software architectures must be designed for this purpose. We present in this paper the LIMA multilingual analysis platform, developed at CEA LIST. This configurable platform has been designed to develop NLP based industrial applications while keeping enough flexibility to integrate various processes and resources. This design makes LIMA a linguistic analyzer that can handle languages as different as French, English, German, Arabic or Chinese. Beyond its architecture principles and its capabilities as a linguistic analyzer, LIMA also offers a set of tools dedicated to the test and the evaluation of linguistic modules and to the production and the management of new linguistic resources.

pdf bib
Testing Semantic Similarity Measures for Extracting Synonyms from a Corpus
Olivier Ferret
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The definition of lexical semantic similarity measures has been the subject of lots of works for many years. In this article, we focus more specifically on distributional semantic similarity measures. Although several evaluations of this kind of measures were already achieved for determining if they actually catch semantic relatedness, it is still difficult to determine if a measure that performs well in an evaluation framework can be applied more widely with the same success. In the work we present here, we first select a semantic similarity measure by testing a large set of such measures against the WordNet-based Synonymy Test, an extended TOEFL test proposed in (Freitag et al., 2005), and we show that its accuracy is comparable to the accuracy of the best state of the art measures while it has less demanding requirements. Then, we apply this measure for extracting automatically synonyms from a corpus and we evaluate the relevance of this process against two reference resources, WordNet and the Moby thesaurus. Finally, we compare our results in details to those of (Curran and Moens, 2002).

2009

pdf bib
Improving Text Segmentation by Combining Endogenous and Exogenous Methods
Olivier Ferret
Proceedings of the International Conference RANLP-2009

2008

pdf bib
Learning Patterns for Building Resources about Semantic Relations in the Medical Domain
Mehdi Embarek | Olivier Ferret
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this article, we present a method for extracting automatically semantic relations from texts in the medical domain using linguistic patterns. These patterns refer to three levels of information about words: inflected form, lemma and part-of-speech. The method we present consists first in identifying the entities that are part of the relations to extract, that is to say diseases, exams, treatments, drugs or symptoms. Thereafter, sentences that contain couples of entities are extracted and the presence of a semantic relation is validated by applying linguistic patterns. These patterns were previously learnt automatically from a manually annotated corpus by relying onan algorithm based on the edit distance. We first report the results of an evaluation of our medical entity tagger for the five types of entities we have mentioned above and then, more globally, the results of an evaluation of our extraction method for four relations between these entities. Both evaluations were done for French.

2007

pdf bib
Finding document topics for improving topic segmentation
Olivier Ferret
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf bib
Enhancing Electronic Dictionaries with an Index Based on Associations
Olivier Ferret | Michael Zock
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
Building a network of topical relations from a corpus
Olivier Ferret
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Lexical networks such as WordNet are known to have a lack of topical relations although these relations are very useful for tasks such as text summarization or information extraction. In this article, we present a method for automatically building from a large corpus a lexical network whose relations are preferably topical ones. As it does not rely on resources such as dictionaries, this method is based on self-bootstrapping: a network of lexical cooccurrences is first built from a corpus and then, is filtered by using the words of the corpus that are selected by the initial network. We report an evaluation about topic segmentation showing that the results got with the filtered network are the same as the results got with the initial network although the first one is significantly smaller than the second one.

2004

pdf bib
Discovering word senses from a network of lexical cooccurrences
Olivier Ferret
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2002

pdf bib
Building domain specific lexical hierarchies from corpora
Olivier Ferret | Christian Fluhr | Françoise Rousseau-Hans | Jean-Luc Simoni
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Using Collocations for Topic Segmentation and Link Detection
Olivier Ferret
COLING 2002: The 19th International Conference on Computational Linguistics

2001

pdf bib
A Cross-Comparison of Two Clustering Methods
Michele Jardino | Brigitte Grau | Olivier Ferret
Proceedings of the ACL 2001 Workshop on Evaluation Methodologies for Language and Dialogue Systems

pdf bib
Terminological Variants for Document Selection and Question/Answer Matching
Olivier Ferret | Brigitte Grau | Martine Hurault-Plantet | Gabriel Illouz | Christian Jacquemin
Proceedings of the ACL 2001 Workshop on Open-Domain Question Answering

1998

pdf bib
Thematic segmentation of texts: two methods for two kinds of texts
Olivier Ferret | Brigitte Grau | Nicolas Masson
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
How to thematically segment texts by using lexical cohesion?
Olivier Ferret
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

pdf bib
Thematic Segmentation of Texts: Two Methods for Two Kind of Texts
Olivier Ferret | Brigitte Grau | Nicolas Masson
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
How to Thematically Segment Texts by using Lexical Cohesion?
Olivier Ferret
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2