Apprendre de la littérature scientifique : Les réseaux de signalisation en biologie systémique (Literature-based discovery: Signaling Systems in Systemic Biology)
Flavie Landomiel | Cathy Guérineau | Anubhav Gupta | Denis Maurel | Anne Poupon
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Cet article a pour but de montrer la faisabilité d’un système de fouille de texte pour alimenter un moteur d’inférences capable de construire, à partir de prédicats extraits des articles scientifiques, un réseau de signalisation en biologie systémique. Cette fouille se réalise en deux étapes : la recherche de phrases d’intérêt dans un grand corpus scientifique, puis la construction automatique de prédicats. Ces deux étapes utilisent un système de cascades de transducteurs.


Covering various Needs in Temporal Annotation: a Proposal of Extension of ISO TimeML that Preserves Upward Compatibility
Anaïs Lefeuvre-Halftermeyer | Jean-Yves Antoine | Alain Couillault | Emmanuel Schang | Lotfi Abouda | Agata Savary | Denis Maurel | Iris Eshkol | Delphine Battistelli
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper reports a critical analysis of the ISO TimeML standard, in the light of several experiences of temporal annotation that were conducted on spoken French. It shows that the norm suffers from weaknesses that should be corrected to fit a larger variety of needs inNLP and in corpus linguistics. We present our proposition of some improvements of the norm before it will be revised by the ISO Committee in 2017. These modifications concern mainly (1) Enrichments of well identified features of the norm: temporal function of TIMEX time expressions, additional types for TLINK temporal relations; (2) Deeper modifications concerning the units or features annotated: clarification between time and tense for EVENT units, coherence of representation between temporal signals (the SIGNAL unit) and TIMEX modifiers (the MOD feature); (3) A recommendation to perform temporal annotation on top of a syntactic (rather than lexical) layer (temporal annotation on a treebank).

Estimer la notoriété d’un nom propre via Wikipedia (Estimate the notoriety of a Proper name using Wikipedia)
Mouna Elashter | Denis Maurel
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Cet article propose de calculer, via Wikipedia, un indice de notoriété pour les entrées du dictionnaire relationnel multilingue de noms propres Prolexbase. Cet indice de notoriété dépend de la langue et participera, d’une part, à la construction d’un module de Prolexbase pour la langue arabe et, d’autre part, à la révision de la notoriété actuellement présente pour les autres langues de la base. Pour calculer la notoriété, nous utilisons la méthode SAW (précédée du calcul de l’entropie de Shannon) à partir de cinq valeurs numériques déduites de Wikipedia.


Arabic Named Entity Recognition Process using Transducer Cascade and Arabic Wikipedia
Fatma Ben Mesmia | Kais Haddar | Denis Maurel | Nathalie Friburger
Proceedings of the International Conference Recent Advances in Natural Language Processing


Tense and Time Annotations : a Contribution to TimeML Improvement (Annotation de la temporalité en corpus : contribution à l’amélioration de la norme TimeML) [in French]
Anaïs Lefeuvre | Jean-Yves Antoine | Agata Savary | Emmanuel Schang | Lotfi Abouda | Denis Maurel | Iris Eshkol
Proceedings of TALN 2014 (Volume 2: Short Papers)

ANCOR_Centre, a large free spoken French coreference corpus: description of the resource and reliability measures
Judith Muzerelle | Anaïs Lefeuvre | Emmanuel Schang | Jean-Yves Antoine | Aurore Pelletier | Denis Maurel | Iris Eshkol | Jeanne Villaneau
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article presents ANCOR_Centre, a French coreference corpus, available under the Creative Commons Licence. With a size of around 500,000 words, the corpus is large enough to serve the needs of data-driven approaches in NLP and represents one of the largest coreference resources currently available. The corpus focuses exclusively on spoken language, it aims at representing a certain variety of spoken genders. ANCOR_Centre includes anaphora as well as coreference relations which involve nominal and pronominal mentions. The paper describes into details the annotation scheme and the reliability measures computed on the resource.


ProLMF 1.2, Proper Names with their Expansions (ProLMF version 1.2. Une ressource libre de noms propres avec des expansions contextuelles) [in French]
Denis Maurel | Béatrice Bouchou Markhoff
Proceedings of TALN 2013 (Volume 2: Short Papers)

ANCOR, the first large French speaking corpus of conversational speech annotated in coreference to be freely available (ANCOR, premier corpus de français parlé d’envergure annoté en coréférence et distribué librement) [in French]
Judith Muzerelle | Anaïs Lefeuvre | Jean-Yves Antoine | Emmanuel Schang | Denis Maurel | Jeanne Villaneau | Iris Eshkol
Proceedings of TALN 2013 (Volume 2: Short Papers)

CasSys, a free transducer cascade system (CasSys Un système libre de cascades de transducteurs) [in French]
Denis Maurel | Nathalie Friburger
Proceedings of TALN 2013 (Volume 3: System Demonstrations)


A tagged and aligned corpus for the study of Proper Names in translation
Emeline Lecuit | Denis Maurel | Duško Vitas
Proceedings of The Second Workshop on Annotation and Exploitation of Parallel Corpora


Eslo: From Transcription to Speakers’ Personal Information Annotation
Iris Eshkol | Denis Maurel | Nathalie Friburger
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents the preliminary works to put online a French oral corpus and its transcription. This corpus is the Socio-Linguistic Survey in Orleans, realized in 1968. First, we numerized the corpus, then we handwritten transcribed it with the Transcriber software adding different tags about speakers, time, noise, etc. Each document (audio file and XML file of the transcription) was described by a set of metadata stored in an XML format to allow an easy consultation. Second, we added different levels of annotations, recognition of named entities and annotation of personal information about speakers. This two annotation tasks used the CasSys system of transducer cascades. We used and modified a first cascade to recognize named entities. Then we built a second cascade to annote the designating entities, i.e. information about the speaker. These second cascade parsed the named entity annotated corpus. The objective is to locate information about the speaker and, also, what kind of information can designate him/her. These two cascades was evaluated with precision and recall measures.

An Analysis of the Performances of the CasEN Named Entities Recognition System in the Ester2 Evaluation Campaign
Damien Nouvel | Jean-Yves Antoine | Nathalie Friburger | Denis Maurel
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a detailed and critical analysis of the behaviour of the CasEN named entity recognition system during the French Ester2 evaluation campaign. In this project, CasEN has been confronted with the task of detecting and categorizing named entities in manual and automatic transcriptions of radio broadcastings. At first, we give a general presentation of the Ester2 campaign. Then, we describe our system, based on transducers. Next, we depict how systems were evaluated during this campaign and we report the main official results. Afterwards, we investigate in details the influence of some annotation biases which have significantly affected the estimation of the performances of systems. At last, we conduct an in-depth analysis of the effective errors of the CasEN system, providing us with some useful indications about phenomena that gave rise to errors (e.g. metonymy, encapsulation, detection of right boundaries) and are as many challenges for named entity recognition systems.


Prolexbase: a Multilingual Relational Lexical Database of Proper Names
Denis Maurel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper deals with a multilingual relational lexical database of proper name, Prolexbase, a free resource available on the CNRTL website. The Prolex model is based on two main concepts: firstly, a language independent pivot and, secondly, the prolexeme (the projection of the pivot onto particular language), that is a set of lemmas (names and derivatives). These two concepts model the variations of proper name: firstly, independent of language and, secondly, language dependent by morphology or knowledge. Variation processing is very important for NLP: the same proper name can be written in different instances, maybe in different parts of speech, and it can also be replaced by another one, a lexical anaphora (that reveals semantic link). The pivot represents different referent’s points of view, i.e. language independent variations of name. Pivots are linked by three semantic relations (quasi-synonymy, partitive relation and associative relation). The prolexeme is a set of variants (aliases), quasi-synonyms and morphosemantic derivatives. Prolexemes are linked to classifying contexts and reliability code.