Cyril Grouin


2020

pdf bib
Inference Annotation of a Chinese Corpus for Opinion Mining
Liyun Yan | Danni E | Mei Gan | Cyril Grouin | Mathieu Valette
Proceedings of the 12th Language Resources and Evaluation Conference

Polarity classification (positive, negative or neutral opinion detection) is well developed in the field of opinion mining. However, existing tools, which perform with high accuracy on short sentences and explicit expressions, have limited success interpreting narrative phrases and inference contexts. In this article, we will discuss an important aspect of opinion mining: inference. We will give our definition of inference, classify different types, provide an annotation framework and analyze the annotation results. While inferences are often studied in the field of Natural-language understanding (NLU), we propose to examine inference as it relates to opinion mining. Firstly, based on linguistic analysis, we clarify what kind of sentence contains an inference. We define five types of inference: logical inference, pragmatic inference, lexical inference, enunciative inference and discursive inference. Second, we explain our annotation framework which includes both inference detection and opinion mining. In short, this manual annotation determines whether or not a target contains an inference. If so, we then define inference type, polarity and topic. Using the results of this annotation, we observed several correlation relations which will be used to determine distinctive features for automatic inference classification in further research. We also demonstrate the results of three preliminary classification experiments.

pdf bib
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Atelier DÉfi Fouille de Textes
Rémi Cardon | Natalia Grabar | Cyril Grouin | Thierry Hamon
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Atelier DÉfi Fouille de Textes

pdf bib
Présentation de la campagne d’évaluation DEFT 2020 : similarité textuelle en domaine ouvert et extraction d’information précise dans des cas cliniques (Presentation of the DEFT 2020 Challenge : open domain textual similarity and precise information extraction from clinical cases )
Rémi Cardon | Natalia Grabar | Cyril Grouin | Thierry Hamon
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Atelier DÉfi Fouille de Textes

L’édition 2020 du défi fouille de texte (DEFT) a proposé deux tâches autour de la similarité textuelle et une tâche d’extraction d’information. La première tâche vise à identifier le degré de similarité entre paires de phrases sur une échelle de 0 (le moins similaire) à 5 (le plus similaire). Les résultats varient de 0,65 à 0,82 d’EDRM. La deuxième tâche consiste à déterminer la phrase la plus proche d’une phrase source parmi trois phrases cibles fournies, avec des résultats très élevés, variant de 0,94 à 0,99 de précision. Ces deux tâches reposent sur un corpus du domaine général et de santé. La troisième tâche propose d’extraire dix catégories d’informations du domaine médical depuis le corpus de cas cliniques de DEFT 2019. Les résultats varient de 0,07 à 0,66 de F-mesure globale pour la sous-tâche des pathologies et signes ou symptômes, et de 0,14 à 0,76 pour la sous-tâche sur huit catégories médicales. Les méthodes utilisées reposent sur des CRF et des réseaux de neurones.

2019

pdf bib
Corpus annoté de cas cliniques en français (Annotated corpus with clinical cases in French)
Natalia Grabar | Cyril Grouin | Thierry Hamon | Vincent Claveau
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume I : Articles longs

Les corpus textuels sont utiles pour diverses applications de traitement automatique des langues (TAL) en fournissant les données nécessaires pour leur création, adaptation ou évaluation. Cependant, dans certains domaines comme le domaine médical, l’accès aux données est rendu compliqué, voire impossible, pour des raisons de confidentialité et d’éthique. Il existe néanmoins de réels besoins en corpus cliniques pour l’enseignement et la recherche. Pour répondre à ce défi, nous présentons dans cet article le corpus CAS contenant des cas cliniques de patients, réels ou fictifs, que nous avons compilés. Ces cas cliniques en français couvrent plusieurs spécialités médicales et focalisent donc sur différentes situations cliniques. Actuellement, le corpus contient 4 300 cas (environ 1,5M d’occurrences de mots). Il est accompagné d’informations (discussions des cas cliniques, mots-clés, etc.) et d’annotations que nous avons effectuées au regard des besoins de la recherche en TAL dans ce domaine. Nous présentons également les résultats de premières expériences de recherche et d’extraction d’information qui ont été effectuées avec ce corpus annoté. Ces expériences peuvent fournir une baseline à d’autres chercheurs souhaitant travailler avec les données.

pdf bib
Recherche et extraction d’information dans des cas cliniques. Présentation de la campagne d’évaluation DEFT 2019 (Information Retrieval and Information Extraction from Clinical Cases)
Natalia Grabar | Cyril Grouin | Thierry Hamon | Vincent Claveau
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Défi Fouille de Textes (atelier TALN-RECITAL)

Cet article présente la campagne d’évaluation DEFT 2019 sur l’analyse de textes cliniques rédigés en français. Le corpus se compose de cas cliniques publiés et discutés dans des articles scientifiques, et indexés par des mots-clés. Nous proposons trois tâches indépendantes : l’indexation des cas cliniques et discussions, évaluée prioritairement par la MAP (mean average precision), l’appariement entre cas cliniques et discussions, évalué au moyen d’une précision, et l’extraction d’information parmi quatre catégories (âge, genre, origine de la consultation, issue), évaluée en termes de rappel, précision et F-mesure. Nous présentons les résultats obtenus par les participants sur chaque tâche.

pdf bib
Clinical Case Reports for NLP
Cyril Grouin | Natalia Grabar | Vincent Claveau | Thierry Hamon
Proceedings of the 18th BioNLP Workshop and Shared Task

Textual data are useful for accessing expert information. Yet, since the texts are representative of distinct language uses, it is necessary to build specific corpora in order to be able to design suitable NLP tools. In some domains, such as medical domain, it may be complicated to access the representative textual data and their semantic annotations, while there exists a real need for providing efficient tools and methods. Our paper presents a corpus of clinical cases written in French, and their semantic annotations. Thus, we manually annotated a set of 717 files into four general categories (age, gender, outcome, and origin) for a total number of 2,835 annotations. The values of age, gender, and outcome are normalized. A subset with 70 files has been additionally manually annotated into 27 categories for a total number of 5,198 annotations.

pdf bib
Community Perspective on Replicability in Natural Language Processing
Margot Mieskes | Karën Fort | Aurélie Névéol | Cyril Grouin | Kevin Cohen
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

With recent efforts in drawing attention to the task of replicating and/or reproducing results, for example in the context of COLING 2018 and various LREC workshops, the question arises how the NLP community views the topic of replicability in general. Using a survey, in which we involve members of the NLP community, we investigate how our community perceives this topic, its relevance and options for improvement. Based on over two hundred participants, the survey results confirm earlier observations, that successful reproducibility requires more than having access to code and data. Additionally, the results show that the topic has to be tackled from the authors’, reviewers’ and community’s side.

2018

pdf bib
Three Dimensions of Reproducibility in Natural Language Processing
K. Bretonnel Cohen | Jingbo Xia | Pierre Zweigenbaum | Tiffany Callahan | Orin Hargraves | Foster Goss | Nancy Ide | Aurélie Névéol | Cyril Grouin | Lawrence E. Hunter
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Simplification de schémas d’annotation : un aller sans retour ? (Annotation scheme simplification : a one way trip with no return ?)
Cyril Grouin
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Dans cet article, nous comparons l’impact de la simplification d’un schéma d’annotation sur un système de repérage d’entités nommées (REN). Une simplification consiste à rassembler les types d’entités nommées (EN) sous deux types génériques (personne et lieu), l’autre revient à mieux définir chaque type d’EN. Nous observons une amélioration des résultats sur les deux versions simplifiées. Nous étudions également la possibilité de retrouver le niveau de détail des types d’EN du schéma d’origine à partir des versions simplifiées. L’utilisation de règles de conversion permet de recouvrer les types d’EN d’origine, mais il reste une forme d’ambiguïté contextuelle qu’il est impossible de lever au moyen de règles.

pdf bib
DEFT2018 : recherche d’information et analyse de sentiments dans des tweets concernant les transports en Île de France (DEFT2018 : Information Retrieval and Sentiment Analysis in Tweets about Public Transportation in Île de France Region )
Patrick Paroubek | Cyril Grouin | Patrice Bellot | Vincent Claveau | Iris Eshkol-Taravella | Amel Fraisse | Agata Jackiewicz | Jihen Karoui | Laura Monceaux | Juan-Manuel Torres-Moreno
Actes de la Conférence TALN. Volume 2 - Démonstrations, articles des Rencontres Jeunes Chercheurs, ateliers DeFT

Cet article présente l’édition 2018 de la campagne d’évaluation DEFT (Défi Fouille de Textes). A partir d’un corpus de tweets, quatre tâches ont été proposées : identifier les tweets sur la thématique des transports, puis parmi ces derniers, identifier la polarité (négatif, neutre, positif, mixte), identifier les marqueurs de sentiment et la cible, et enfin, annoter complètement chaque tweet en source et cible des sentiments exprimés. Douze équipes ont participé, majoritairement sur les deux premières tâches. Sur l’identification de la thématique des transports, la micro F-mesure varie de 0,827 à 0,908. Sur l’identification de la polarité globale, la micro F-mesure varie de 0,381 à 0,823.

2017

pdf bib
Traitement automatique de la langue biomédicale au LIMSI (Biomedical language processing at LIMSI)
Christopher Norman | Cyril Grouin | Thomas Lavergne | Aurélie Névéol | Pierre Zweigenbaum
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 3 - Démonstrations

Nous proposons des démonstrations de trois outils développés par le LIMSI en traitement automatique des langues appliqué au domaine biomédical : la détection de concepts médicaux dans des textes courts, la catégorisation d’articles scientifiques pour l’assistance à l’écriture de revues systématiques, et l’anonymisation de textes cliniques.

pdf bib
Generating a Training Corpus for OCR Post-Correction Using Encoder-Decoder Model
Eva D’hondt | Cyril Grouin | Brigitte Grau
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this paper we present a novel approach to the automatic correction of OCR-induced orthographic errors in a given text. While current systems depend heavily on large training corpora or external information, such as domain-specific lexicons or confidence scores from the OCR process, our system only requires a small amount of (relatively) clean training data from a representative corpus to learn a character-based statistical language model using Bidirectional Long Short-Term Memory Networks (biLSTMs). We demonstrate the versatility and adaptability of our system on different text corpora with varying degrees of textual noise, including a real-life OCR corpus in the medical domain.

2016

pdf bib
Identification of Drug-Related Medical Conditions in Social Media
François Morlane-Hondère | Cyril Grouin | Pierre Zweigenbaum
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Monitoring social media has been shown to be an interesting approach for the early detection of drug adverse effects. In this paper, we describe a system which extracts medical entities in French drug reviews written by users. We focus on the identification of medical conditions, which is based on the concept of post-coordination: we first extract minimal medical-related entities (pain, stomach) then we combine them to identify complex ones (It was the worst [pain I ever felt in my stomach]). These two steps are respectively performed by two classifiers, the first being based on Conditional Random Fields and the second one on Support Vector Machines. The overall results of the minimal entity classifier are the following: P=0.926; R=0.849; F1=0.886. A thourough analysis of the feature set shows that, when combined with word lemmas, clusters generated by word2vec are the most valuable features. When trained on the output of the first classifier, the second classifier’s performances are the following: p=0.683;r=0.956;f1=0.797. The addition of post-processing rules did not add any significant global improvement but was found to modify the precision/recall ratio.

pdf bib
Text Segmentation of Digitized Clinical Texts
Cyril Grouin
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we present the experiments we made to recover the original page layout structure into two columns from layout damaged digitized files. We designed several CRF-based approaches, either to identify column separator or to classify each token from each line into left or right columns. We achieved our best results with a model trained on homogeneous corpora (only files composed of 2 columns) when classifying each token into left or right columns (overall F-measure of 0.968). Our experiments show it is possible to recover the original layout in columns of digitized documents with results of quality.

pdf bib
Controlled Propagation of Concept Annotations in Textual Corpora
Cyril Grouin
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we presented the annotation propagation tool we designed to be used in conjunction with the BRAT rapid annotation tool. We designed two experiments to annotate a corpus of 60 files, first not using our tool, second using our propagation tool. We evaluated the annotation time and the quality of annotations. We shown that using the annotation propagation tool reduces by 31.7% the time spent to annotate the corpus with a better quality of results.

pdf bib
Identification of Mentions and Relations between Bacteria and Biotope from PubMed Abstracts
Cyril Grouin
Proceedings of the 4th BioNLP Shared Task Workshop

pdf bib
A Dataset for ICD-10 Coding of Death Certificates: Creation and Usage
Thomas Lavergne | Aurélie Névéol | Aude Robert | Cyril Grouin | Grégoire Rey | Pierre Zweigenbaum
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

Very few datasets have been released for the evaluation of diagnosis coding with the International Classification of Diseases, and only one so far in a language other than English. This paper describes a large-scale dataset prepared from French death certificates, and the problems which needed to be solved to turn it into a dataset suitable for the application of machine learning and natural language processing methods of ICD-10 coding. The dataset includes the free-text statements written by medical doctors, the associated meta-data, the human coder-assigned codes for each statement, as well as the statement segments which supported the coder’s decision for each code. The dataset comprises 93,694 death certificates totalling 276,103 statements and 377,677 ICD-10 code assignments (3,457 unique codes). It was made available for an international automated coding shared task, which attracted five participating teams. An extended version of the dataset will be used in a new edition of the shared task.

pdf bib
Supervised classification of end-of-lines in clinical text with no manual annotation
Pierre Zweigenbaum | Cyril Grouin | Thomas Lavergne
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

In some plain text documents, end-of-line marks may or may not mark the boundary of a text unit (e.g., of a paragraph). This vexing problem is likely to impact subsequent natural language processing components, but is seldom addressed in the literature. We propose a method which uses no manual annotation to classify whether end-of-lines must actually be seen as simple spaces (soft line breaks) or as true text unit boundaries. This method, which includes self-training and co-training steps based on token and line length features, achieves 0.943 F-measure on a corpus of short e-books with controlled format, F=0.904 on a random sample of 24 clinical texts with soft line breaks, and F=0.898 on a larger set of mixed clinical texts which may or may not contain soft line breaks, a fairly high value for a method with no manual annotation.

pdf bib
Detection of Text Reuse in French Medical Corpora
Eva D’hondt | Cyril Grouin | Aurélie Névéol | Efstathios Stamatatos | Pierre Zweigenbaum
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

Electronic Health Records (EHRs) are increasingly available in modern health care institutions either through the direct creation of electronic documents in hospitals’ health information systems, or through the digitization of historical paper records. Each EHR creation method yields the need for sophisticated text reuse detection tools in order to prepare the EHR collections for efficient secondary use relying on Natural Language Processing methods. Herein, we address the detection of two types of text reuse in French EHRs: 1) the detection of updated versions of the same document and 2) the detection of document duplicates that still bear surface differences due to OCR or de-identification processing. We present a robust text reuse detection method to automatically identify redundant document pairs in two French EHR corpora that achieves an overall macro F-measure of 0.68 and 0.60, respectively and correctly identifies all redundant document pairs of interest.

pdf bib
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis
Cyril Grouin | Thierry Hamon | Aurélie Névéol | Pierre Zweigenbaum
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

pdf bib
Low-resource OCR error detection and correction in French Clinical Texts
Eva D’hondt | Cyril Grouin | Brigitte Grau
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

pdf bib
Replicability of Research in Biomedical Natural Language Processing: a pilot evaluation for a coding task
Aurélie Névéol | Kevin Cohen | Cyril Grouin | Aude Robert
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

pdf bib
Une catégorisation de fins de lignes non-supervisée (End-of-line classification with no supervision)
Pierre Zweigenbaum | Cyril Grouin | Thomas Lavergne
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Dans certains textes bruts, les marques de fin de ligne peuvent marquer ou pas la frontière d’une unité textuelle (typiquement un paragraphe). Ce problème risque d’influencer les traitements subséquents, mais est rarement traité dans la littérature. Nous proposons une méthode entièrement non-supervisée pour déterminer si une fin de ligne doit être vue comme un simple espace ou comme une véritable frontière d’unité textuelle, et la testons sur un corpus de comptes rendus médicaux. Cette méthode obtient une F-mesure de 0,926 sur un échantillon de 24 textes contenant des lignes repliées. Appliquée sur un échantillon plus grand de textes contenant ou pas des lignes repliées, notre méthode la plus prudente obtient une F-mesure de 0,898, valeur élevée pour une méthode entièrement non-supervisée.

pdf bib
LIMSI at SemEval-2016 Task 12: machine-learning and temporal information to identify clinical events and time expressions
Cyril Grouin | Véronique Moriceau
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Identification de facteurs de risque pour des patients diabétiques à partir de comptes-rendus cliniques par des approches hybrides
Cyril Grouin | Véronique Moriceau | Sophie Rosset | Pierre Zweigenbaum
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Dans cet article, nous présentons les méthodes que nous avons développées pour analyser des comptes- rendus hospitaliers rédigés en anglais. L’objectif de cette étude consiste à identifier les facteurs de risque de décès pour des patients diabétiques et à positionner les événements médicaux décrits par rapport à la date de création de chaque document. Notre approche repose sur (i) HeidelTime pour identifier les expressions temporelles, (ii) des CRF complétés par des règles de post-traitement pour identifier les traitements, les maladies et facteurs de risque, et (iii) des règles pour positionner temporellement chaque événement médical. Sur un corpus de 514 documents, nous obtenons une F-mesure globale de 0,8451. Nous observons que l’identification des informations directement mentionnées dans les documents se révèle plus performante que l’inférence d’informations à partir de résultats de laboratoire.

pdf bib
Étude des verbes introducteurs de noms de médicaments dans les forums de santé
François Morlane-Hondère | Cyril Grouin | Pierre Zweigenbaum
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous combinons annotations manuelle et automatique pour identifier les verbes utilisés pour introduire un médicament dans les messages sur les forums de santé. Cette information est notamment utile pour identifier la relation entre un médicament et un effet secondaire. La mention d’un médicament dans un message ne garantit pas que l’utilisateur a pris ce traitement mais qu’il effectue un retour. Nous montrons ensuite que ces verbes peuvent servir pour extraire automatiquement des variantes de noms de médicaments. Nous estimons que l’analyse de ces variantes pourrait permettre de modéliser les erreurs faites par les usagers des forums lorsqu’ils écrivent les noms de médicaments, et améliorer en conséquence les systèmes de recherche d’information.

pdf bib
Médicaments qui soignent, médicaments qui rendent malades : étude des relations causales pour identifier les effets secondaires
François Morlane-Hondère | Cyril Grouin | Véronique Moriceau | Pierre Zweigenbaum
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous nous intéressons à la manière dont sont exprimés les liens qui existent entre un traitement médical et un effet secondaire. Parce que les patients se tournent en priorité vers internet, nous fondons cette étude sur un corpus annoté de messages issus de forums de santé en français. L’objectif de ce travail consiste à mettre en évidence des éléments linguistiques (connecteurs logiques et expressions temporelles) qui pourraient être utiles pour des systèmes automatiques de repérage des effets secondaires. Nous observons que les modalités d’écriture sur les forums ne permettent pas de se fonder sur les expressions temporelles. En revanche, les connecteurs logiques semblent utiles pour identifier les effets secondaires.

pdf bib
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis
Cyril Grouin | Thierry Hamon | Aurélie Névéol | Pierre Zweigenbaum
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

pdf bib
Is it possible to recover personal health information from an automatically de-identified corpus of French EHRs?
Cyril Grouin | Nicolas Griffon | Aurélie Névéol
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

2014

pdf bib
Optimizing annotation efforts to build reliable annotated corpora for training statistical models
Cyril Grouin | Thomas Lavergne | Aurélie Névéol
Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop

pdf bib
TALN-RECITAL 2014 Workshop DEFT 2014 : DÉfi Fouille de Textes (DEFT 2014 Workshop: Text Mining Challenge)
Cyril Grouin
TALN-RECITAL 2014 Workshop DEFT 2014 : DÉfi Fouille de Textes (DEFT 2014 Workshop: Text Mining Challenge)

pdf bib
Automatic Analysis of Scientific and Literary Texts. Presentation and Results of the DEFT2014 Text Mining Challenge (Analyse automatique de textes littéraires et scientifiques : présentation et résultats du défi fouille de texte DEFT2014) [in French]
Thierry Hamon | Quentin Pleplé | Patrick Paroubek | Pierre Zweigenbaum | Cyril Grouin
TALN-RECITAL 2014 Workshop DEFT 2014 : DÉfi Fouille de Textes (DEFT 2014 Workshop: Text Mining Challenge)

pdf bib
Biomedical entity extraction using machine-learning based approaches
Cyril Grouin
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present the experiments we made to process entities from the biomedical domain. Depending on the task to process, we used two distinct supervised machine-learning techniques: Conditional Random Fields to perform both named entity identification and classification, and Maximum Entropy to classify given entities. Machine-learning approaches outperformed knowledge-based techniques on categories where sufficient annotated data was available. We showed that the use of external features (unsupervised clusters, information from ontology and taxonomy) improved the results significantly.

pdf bib
Morpho-Syntactic Study of Errors from Speech Recognition System
Maria Goryainova | Cyril Grouin | Sophie Rosset | Ioana Vasilescu
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The study provides an original standpoint of the speech transcription errors by focusing on the morpho-syntactic features of the erroneous chunks and of the surrounding left and right context. The typology concerns the forms, the lemmas and the POS involved in erroneous chunks, and in the surrounding contexts. Comparison with error free contexts are also provided. The study is conducted on French. Morpho-syntactic analysis underlines that three main classes are particularly represented in the erroneous chunks: (i) grammatical words (to, of, the), (ii) auxiliary verbs (has, is), and (iii) modal verbs (should, must). Such items are widely encountered in the ASR outputs as frequent candidates to transcription errors. The analysis of the context points out that some left 3-grams contexts (e.g., repetitions, that is disfluencies, bracketing formulas such as “c’est”, etc.) may be better predictors than others. Finally, the surface analysis conducted through a Levensthein distance analysis, highlighted that the most common distance is of 2 characters and mainly involves differences between inflected forms of a unique item.

pdf bib
Use of unsupervised word classes for entity recognition: Application to the detection of disorders in clinical reports
Maria Evangelia Chatzimina | Cyril Grouin | Pierre Zweigenbaum
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Unsupervised word classes induced from unannotated text corpora are increasingly used to help tasks addressed by supervised classification, such as standard named entity detection. This paper studies the contribution of unsupervised word classes to a medical entity detection task with two specific objectives: How do unsupervised word classes compare to available knowledge-based semantic classes? Does syntactic information help produce unsupervised word classes with better properties? We design and test two syntax-based methods to produce word classes: one applies the Brown clustering algorithm to syntactic dependencies, the other collects latent categories created by a PCFG-LA parser. When added to non-semantic features, knowledge-based semantic classes gain 7.28 points of F-measure. In the same context, basic unsupervised word classes gain 4.16pt, reaching 60% of the contribution of knowledge-based semantic classes and outperforming Wikipedia, and adding PCFG-LA unsupervised word classes gain one more point at 5.11pt, reaching 70%. Unsupervised word classes could therefore provide a useful semantic back-off in domains where no knowledge-based semantic classes are available. The combination of both knowledge-based and basic unsupervised classes gains 8.33pt. Therefore, unsupervised classes are still useful even when rich knowledge-based classes exist.

pdf bib
Annotation of specialized corpora using a comprehensive entity and relation scheme
Louise Deléger | Anne-Laure Ligozat | Cyril Grouin | Pierre Zweigenbaum | Aurélie Névéol
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Annotated corpora are essential resources for many applications in Natural Language Processing. They provide insight on the linguistic and semantic characteristics of the genre and domain covered, and can be used for the training and evaluation of automatic tools. In the biomedical domain, annotated corpora of English texts have become available for several genres and subfields. However, very few similar resources are available for languages other than English. In this paper we present an effort to produce a high-quality corpus of clinical documents in French, annotated with a comprehensive scheme of entities and relations. We present the annotation scheme as well as the results of a pilot annotation study covering 35 clinical documents in a variety of subfields and genres. We show that high inter-annotator agreement can be achieved using a complex annotation scheme.

pdf bib
Human annotation of ASR error regions: Is “gravity” a sharable concept for human annotators?
Daniel Luzzati | Cyril Grouin | Ioana Vasilescu | Martine Adda-Decker | Eric Bilinski | Nathalie Camelin | Juliette Kahn | Carole Lailler | Lori Lamel | Sophie Rosset
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper is concerned with human assessments of the severity of errors in ASR outputs. We did not design any guidelines so that each annotator involved in the study could consider the “seriousness” of an ASR error using their own scientific background. Eight human annotators were involved in an annotation task on three distinct corpora, one of the corpora being annotated twice, hiding this annotation in duplicate to the annotators. None of the computed results (inter-annotator agreement, edit distance, majority annotation) allow any strong correlation between the considered criteria and the level of seriousness to be shown, which underlines the difficulty for a human to determine whether a ASR error is serious or not.

2013

pdf bib
Building A Contrasting Taxa Extractor for Relation Identification from Assertions: BIOlogical Taxonomy & Ontology Phrase Extraction System
Cyril Grouin
Proceedings of the BioNLP Shared Task 2013 Workshop

pdf bib
Automatic Named Entity Pre-annotation for Out-of-domain Human Annotation
Sophie Rosset | Cyril Grouin | Thomas Lavergne | Mohamed Ben Jannet | Jérémy Leixa | Olivier Galibert | Pierre Zweigenbaum
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

pdf bib
Studying frequency-based approaches to process lexical simplification (Approches à base de fréquences pour la simplification lexicale) [in French]
Anne-Laure Ligozat | Cyril Grouin | Anne Garcia-Fernandez | Delphine Bernhard
Proceedings of TALN 2013 (Volume 1: Long Papers)

2012

pdf bib
ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking
Anne-Laure Ligozat | Cyril Grouin | Anne Garcia-Fernandez | Delphine Bernhard
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Extended Named Entities Annotation on OCRed Documents: From Corpus Constitution to Evaluation Campaign
Olivier Galibert | Sophie Rosset | Cyril Grouin | Pierre Zweigenbaum | Ludovic Quintard
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Within the framework of the Quaero project, we proposed a new definition of named entities, based upon an extension of the coverage of named entities as well as the structure of those named entities. In this new definition, the extended named entities we proposed are both hierarchical and compositional. In this paper, we focused on the annotation of a corpus composed of press archives, OCRed from French newspapers of December 1890. We present the methodology we used to produce the corpus and the characteristics of the corpus in terms of named entities annotation. This annotated corpus has been used in an evaluation campaign. We present this evaluation, the metrics we used and the results obtained by the participants.

pdf bib
JEP-TALN-RECITAL 2012, Workshop DEFT 2012: DÉfi Fouille de Textes (DEFT 2012 Workshop: Text Mining Challenge)
Cyril Grouin | Dominic Forest | Gilles Sérasset
JEP-TALN-RECITAL 2012, Workshop DEFT 2012: DÉfi Fouille de Textes (DEFT 2012 Workshop: Text Mining Challenge)

pdf bib
Indexation libre et contrôlée d’articles scientifiques. Présentation et résultats du défi fouille de textes DEFT2012 (Controlled and free indexing of scientific papers. Presentation and results of the DEFT2012 text-mining challenge) [in French]
Patrick Paroubek | Pierre Zweigenbaum | Dominic Forest | Cyril Grouin
JEP-TALN-RECITAL 2012, Workshop DEFT 2012: DÉfi Fouille de Textes (DEFT 2012 Workshop: Text Mining Challenge)

pdf bib
Structured Named Entities in two distinct press corpora: Contemporary Broadcast News and Old Newspapers
Sophie Rosset | Cyril Grouin | Karën Fort | Olivier Galibert | Juliette Kahn | Pierre Zweigenbaum
Proceedings of the Sixth Linguistic Annotation Workshop

pdf bib
Manual Corpus Annotation: Giving Meaning to the Evaluation Metrics
Yann Mathet | Antoine Widlöcher | Karën Fort | Claire François | Olivier Galibert | Cyril Grouin | Juliette Kahn | Sophie Rosset | Pierre Zweigenbaum
Proceedings of COLING 2012: Posters

2011

pdf bib
Proposal for an Extension of Traditional Named Entities: From Guidelines to Evaluation, an Overview
Cyril Grouin | Sophie Rosset | Pierre Zweigenbaum | Karën Fort | Olivier Galibert | Ludovic Quintard
Proceedings of the 5th Linguistic Annotation Workshop

pdf bib
Handling Outlandish Occurrences: Using Rules and Lexicons for Correcting NLP Articles
Elitza Ivanova | Delphine Bernhard | Cyril Grouin
Proceedings of the 13th European Workshop on Natural Language Generation

pdf bib
Structured and Extended Named Entity Evaluation in Automatic Speech Transcriptions
Olivier Galibert | Sophie Rosset | Cyril Grouin | Pierre Zweigenbaum | Ludovic Quintard
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
The Second Evaluation Campaign of PASSAGE on Parsing of French
Patrick Paroubek | Olivier Hamon | Eric de La Clergerie | Cyril Grouin | Anne Vilnat
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

pdf bib
A Corpus for Studying Full Answer Justification
Arnaud Grappy | Brigitte Grau | Olivier Ferret | Cyril Grouin | Véronique Moriceau | Isabelle Robba | Xavier Tannier | Anne Vilnat | Vincent Barbier
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Question answering (QA) systems aim at retrieving precise information from a large collection of documents. To be considered as reliable by users, a QA system must provide elements to evaluate the answer. This notion of answer justification can also be useful when developping a QA system in order to give criteria for selecting correct answers. An answer justification can be found in a sentence, a passage made of several consecutive sentences or several passages of a document or several documents. Thus, we are interesting in pinpointing the set of information that allows to verify the correctness of the answer in a candidate passage and the question elements that are missing in this passage. Moreover, the relevant information is often given in texts in a different form from the question form: anaphora, paraphrases, synonyms. In order to have a better idea of the importance of all the phenomena we underlined, and to provide enough examples at the QA developer's disposal to study them, we decided to build an annotated corpus.

2008

pdf bib
Certification and Cleaning up of a Text Corpus: Towards an Evaluation of the “Grammatical” Quality of a Corpus
Cyril Grouin
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present in this article the methods we used for obtaining measures to ensure the quality and well-formedness of a text corpus. These measures allow us to determine the compatibility of a corpus with the treatments we want to apply on it. We called this method “certification of corpus”. These measures are based upon the characteristics required by the linguistic treatments we have to apply on the corpus we want to certify. Since the certification of corpus allows us to highlight the errors present in a text, we developed modules to carry out an automatic correction. By applying these modules, we reduced the number of errors. In consequence, it increases the quality of the corpus making it possible to use a corpus that a first certification would not have admitted.

pdf bib
Human Judgement as a Parameter in Evaluation Campaigns
Jean-Baptiste Berthelin | Cyril Grouin | Martine Hurault-Plantet | Patrick Paroubek
Coling 2008: Proceedings of the workshop on Human Judgements in Computational Linguistics