Karën Fort


2020

pdf bib
Text Corpora and the Challenge of Newly Written Languages
Alice Millour | Karën Fort
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Text corpora represent the foundation on which most natural language processing systems rely. However, for many languages, collecting or building a text corpus of a sufficient size still remains a complex issue, especially for corpora that are accessible and distributed under a clear license allowing modification (such as annotation) and further resharing. In this paper, we review the sources of text corpora usually called upon to fill the gap in low-resource contexts, and how crowdsourcing has been used to build linguistic resources. Then, we present our own experiments with crowdsourcing text corpora and an analysis of the obstacles we encountered. Although the results obtained in terms of participation are still unsatisfactory, we advocate that the effort towards a greater involvement of the speakers should be pursued, especially when the language of interest is newly written.

pdf bib
Reviewing Natural Language Processing Research
Kevin Cohen | Karën Fort | Margot Mieskes | Aurélie Névéol
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

This tutorial will cover the theory and practice of reviewing research in natural language processing. Heavy reviewing burdens on natural language processing researchers have made it clear that our community needs to increase the size of our pool of potential reviewers. Simultaneously, notable “false negatives”---rejection by our conferences of work that was later shown to be tremendously important after acceptance by other conferences—have raised awareness of the fact that our reviewing practices leave something to be desired. We do not often talk about “false positives” with respect to conference papers, but leaders in the field have noted that we seem to have a publication bias towards papers that report high performance, with perhaps not much else of interest in them. It need not be this way. Reviewing is a learnable skill, and you will learn it here via lectures and a considerable amount of hands-on practice.

pdf bib
Creating Expert Knowledge by Relying on Language Learners: a Generic Approach for Mass-Producing Language Resources by Combining Implicit Crowdsourcing and Language Learning
Lionel Nicolas | Verena Lyding | Claudia Borg | Corina Forascu | Karën Fort | Katerina Zdravkova | Iztok Kosem | Jaka Čibej | Špela Arhar Holdt | Alice Millour | Alexander König | Christos Rodosthenous | Federico Sangati | Umair ul Hassan | Anisia Katinskaia | Anabela Barreiro | Lavinia Aparaschivei | Yaakov HaCohen-Kerner
Proceedings of the 12th Language Resources and Evaluation Conference

We introduce in this paper a generic approach to combine implicit crowdsourcing and language learning in order to mass-produce language resources (LRs) for any language for which a crowd of language learners can be involved. We present the approach by explaining its core paradigm that consists in pairing specific types of LRs with specific exercises, by detailing both its strengths and challenges, and by discussing how much these challenges have been addressed at present. Accordingly, we also report on on-going proof-of-concept efforts aiming at developing the first prototypical implementation of the approach in order to correct and extend an LR called ConceptNet based on the input crowdsourced from language learners. We then present an international network called the European Network for Combining Language Learning with Crowdsourcing Techniques (enetCollect) that provides the context to accelerate the implementation of this generic approach. Finally, we exemplify how it can be used in several language learning scenarios to produce a multitude of NLP resources and how it can therefore alleviate the long-standing NLP issue of the lack of LRs.

pdf bib
Rigor Mortis: Annotating MWEs with a Gamified Platform
Karën Fort | Bruno Guillaume | Yann-Alan Pilatte | Mathieu Constant | Nicolas Lefèbvre
Proceedings of the 12th Language Resources and Evaluation Conference

We present here Rigor Mortis, a gamified crowdsourcing platform designed to evaluate the intuition of the speakers, then train them to annotate multi-word expressions (MWEs) in French corpora. We previously showed that the speakers’ intuition is reasonably good (65% in recall on non-fixed MWE). We detail here the annotation results, after a training phase using some of the tests developed in the PARSEME-FR project.

pdf bib
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). 2e atelier Éthique et TRaitemeNt Automatique des Langues (ETeRNAL)
Gilles Adda | Maxime Amblard | Karën Fort
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). 2e atelier Éthique et TRaitemeNt Automatique des Langues (ETeRNAL)

pdf bib
Répliquer et étendre pour l’alsacien “Étiquetage en parties du discours de langues peu dotées par spécialisation des plongements lexicaux” (Replicating and extending for Alsatian : “POS tagging for low-resource languages by adapting word embeddings”)
Alice Millour | Karën Fort | Pierre Magistry
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). 2e atelier Éthique et TRaitemeNt Automatique des Langues (ETeRNAL)

Nous présentons ici les résultats d’un travail de réplication et d’extension pour l’alsacien d’une expérience concernant l’étiquetage en parties du discours de langues peu dotées par spécialisation des plongements lexicaux (Magistry et al., 2018). Ce travail a été réalisé en étroite collaboration avec les auteurs de l’article d’origine. Cette interaction riche nous a permis de mettre au jour les éléments manquants dans la présentation de l’expérience, de les compléter, et d’étendre la recherche à la robustesse à la variation.

2019

pdf bib
Community Perspective on Replicability in Natural Language Processing
Margot Mieskes | Karën Fort | Aurélie Névéol | Cyril Grouin | Kevin Cohen
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

With recent efforts in drawing attention to the task of replicating and/or reproducing results, for example in the context of COLING 2018 and various LREC workshops, the question arises how the NLP community views the topic of replicability in general. Using a survey, in which we involve members of the NLP community, we investigate how our community perceives this topic, its relevance and options for improvement. Based on over two hundred participants, the survey results confirm earlier observations, that successful reproducibility requires more than having access to code and data. Additionally, the results show that the topic has to be tackled from the authors’, reviewers’ and community’s side.

pdf bib
Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling
Alice Millour | Karën Fort
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Building representative linguistic resources and NLP tools for non-standardized languages is challenging: when spelling is not determined by a norm, multiple written forms can be encountered for a given word, inducing a large proportion of out-of-vocabulary words. To embrace this diversity, we propose a methodology based on crowdsourced alternative spellings we use to extract rules applied to match OOV words with one of their spelling variants. This virtuous process enables the unsupervised augmentation of multi-variant lexicons without expert rule definition. We apply this multilingual methodology on Alsatian, a French regional language and provide an intrinsic evaluation of the correctness of the variants pairs, and an extrinsic evaluation on a downstream task. We show that in a low-resource scenario, 145 inital pairs can lead to the generation of 876 additional variant pairs, and a diminution of OOV words improving the part-of-speech tagging performance by 1 to 4%.

2018

pdf bib
Toward a Lightweight Solution for Less-resourced Languages: Creating a POS Tagger for Alsatian Using Voluntary Crowdsourcing
Alice Millour | Karën Fort
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
“Fingers in the Nose”: Evaluating Speakers’ Identification of Multi-Word Expressions Using a Slightly Gamified Crowdsourcing Platform
Karën Fort | Bruno Guillaume | Matthieu Constant | Nicolas Lefèbvre | Yann-Alan Pilatte
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This article presents the results we obtained in crowdsourcing French speakers’ intuition concerning multi-work expressions (MWEs). We developed a slightly gamified crowdsourcing platform, part of which is designed to test users’ ability to identify MWEs with no prior training. The participants perform relatively well at the task, with a recall reaching 65% for MWEs that do not behave as function words.

2017

pdf bib
Vers une solution légère de production de données pour le TAL : création d’un tagger de l’alsacien par crowdsourcing bénévole (Toward a lightweight solution to the language resources bottleneck issue: creating a POS tagger for Alsatian using voluntary crowdsourcing)
Alice Millour | Karën Fort | Delphine Bernhard | Lucie Steiblé
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs

Nous présentons ici les résultats d’une expérience menée sur l’annotation en parties du discours d’un corpus d’une langue régionale encore peu dotée, l’alsacien, via une plateforme de myriadisation (crowdsourcing) bénévole développée spécifiquement à cette fin : Bisame1 . La plateforme, mise en ligne en mai 2016, nous a permis de recueillir 15 846 annotations grâce à 42 participants. L’évaluation des annotations, réalisée sur un corpus de référence, montre que la F-mesure des annotations volontaires est de 0, 93. Le tagger entraîné sur le corpus annoté atteint lui 82 % d’exactitude. Il s’agit du premier tagger spécifique à l’alsacien. Cette méthode de développement de ressources langagières est donc efficace et prometteuse pour certaines langues peu dotées, dont un nombre suffisant de locuteurs est connecté et actif sur le Web. Le code de la plateforme, le corpus annoté et le tagger sont librement disponibles.

pdf bib
Vers l’annotation par le jeu de corpus (plus) complexes : le cas de la langue de spécialité (Towards (more) complex corpora annotation using a game with a purpose : the case of scientific language)
Karën Fort | Bruno Guillaume | Nicolas Lefebvre | Laura Ramírez | Mathilde Regnault | Mary Collins | Oksana Gavrilova | Tanti Kristanti
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Nous avons précédemment montré qu’il est possible de faire produire des annotations syntaxiques de qualité par des participants à un jeu ayant un but. Nous présentons ici les résultats d’une expérience visant à évaluer leur production sur un corpus plus complexe, en langue de spécialité, en l’occurrence un corpus de textes scientifiques sur l’ADN. Nous déterminons précisément la complexité de ce corpus, puis nous évaluons les annotations en syntaxe de dépendances produites par les joueurs par rapport à une référence mise au point par des experts du domaine.

2016

pdf bib
Yes, We Care! Results of the Ethics and Natural Language Processing Surveys
Karën Fort | Alain Couillault
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present here the context and results of two surveys (a French one and an international one) concerning Ethics and NLP, which we designed and conducted between June and September 2015. These surveys follow other actions related to raising concern for ethics in our community, including a Journée d’études, a workshop and the Ethics and Big Data Charter. The concern for ethics shows to be quite similar in both surveys, despite a few differences which we present and discuss. The surveys also lead to think there is a growing awareness in the field concerning ethical issues, which translates into a willingness to get involved in ethics-related actions, to debate about the topic and to see ethics be included in major conferences themes. We finally discuss the limits of the surveys and the means of action we consider for the future. The raw data from the two surveys are freely available online.

pdf bib
Crowdsourcing Complex Language Resources: Playing to Annotate Dependency Syntax
Bruno Guillaume | Karën Fort | Nicolas Lefebvre
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This article presents the results we obtained on a complex annotation task (that of dependency syntax) using a specifically designed Game with a Purpose, ZombiLingo. We show that with suitable mechanisms (decomposition of the task, training of the players and regular control of the annotation quality during the game), it is possible to obtain annotations whose quality is significantly higher than that obtainable with a parser, provided that enough players participate. The source code of the game and the resulting annotated corpora (for French) are freely available.

2014

pdf bib
Quantitative study of disfluencies in schizophrenics’ speech: Automatize to limit biases (Étude quantitative des disfluences dans le discours de schizophrènes : automatiser pour limiter les biais) [in French]
Maxime Amblard | Karën Fort
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Annotation scheme for deep dependency syntax of French (Un schéma d’annotation en dépendances syntaxiques profondes pour le français) [in French]
Guy Perrier | Marie Candito | Bruno Guillaume | Corentin Ribeyre | Karën Fort | Djamé Seddah
Proceedings of TALN 2014 (Volume 2: Short Papers)

pdf bib
ZOMBILINGO: eating heads to perform dependency syntax annotation (ZOMBILINGO : manger des têtes pour annoter en syntaxe de dépendances) [in French]
Karën Fort | Bruno Guillaume | Valentin Stern
Proceedings of TALN 2014 (Volume 3: System Demonstrations)

pdf bib
Evaluating corpora documentation with regards to the Ethics and Big Data Charter
Alain Couillault | Karën Fort | Gilles Adda | Hugues de Mazancourt
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The authors have written the Ethic and Big Data Charter in collaboration with various agencies, private bodies and associations. This Charter aims at describing any large or complex resources, and in particular language resources, from a legal and ethical viewpoint and ensuring the transparency of the process of creating and distributing such resources. We propose in this article an analysis of the documentation coverage of the most frequently mentioned language resources with regards to the Charter, in order to show the benefit it offers.

pdf bib
Deep Syntax Annotation of the Sequoia French Treebank
Marie Candito | Guy Perrier | Bruno Guillaume | Corentin Ribeyre | Karën Fort | Djamé Seddah | Éric de la Clergerie
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We define a deep syntactic representation scheme for French, which abstracts away from surface syntactic variation and diathesis alternations, and describe the annotation of deep syntactic representations on top of the surface dependency trees of the Sequoia corpus. The resulting deep-annotated corpus, named deep-sequoia, is freely available, and hopefully useful for corpus linguistics studies and for training deep analyzers to prepare semantic analysis.

pdf bib
Mapping the Lexique des Verbes du Français (Lexicon of French Verbs) to a NLP lexicon using examples
Bruno Guillaume | Karën Fort | Guy Perrier | Paul Bédaride
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article presents experiments aiming at mapping the Lexique des Verbes du Français (Lexicon of French Verbs) to FRILEX, a Natural Language Processing (NLP) lexicon based on D ICOVALENCE. The two resources (Lexicon of French Verbs and D ICOVALENCE) were built by linguists, based on very different theories, which makes a direct mapping nearly impossible. We chose to use the examples provided in one of the resource to find implicit links between the two and make them explicit.

pdf bib
Propa-L: a semantic filtering service from a lexical network created using Games With A Purpose
Mathieu Lafourcade | Karën Fort
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article presents Propa-L, a freely accessible Web service that allows to semantically filter a lexical network. The language resources behind the service are dynamic and created through Games With A Purpose. We show an example of application of this service: the generation of a list of keywords for parental filtering on the Web, but many others can be envisaged. Moreover, the propagation algorithm we present here can be applied to any lexical network, in any language.

2013

pdf bib
Formalizing an annotation guide : some experiments towards assisted agile annotation (Expériences de formalisation d’un guide d’annotation : vers l’annotation agile assistée) [in French]
Bruno Guillaume | Karën Fort
Proceedings of TALN 2013 (Volume 2: Short Papers)

2012

pdf bib
TCOF-POS : un corpus libre de français parlé annoté en morphosyntaxe (TCOF-POS : A Freely Available POS-Tagged Corpus of Spoken French) [in French]
Christophe Benzitoun | Karën Fort | Benoît Sagot
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Annotation manuelle de matchs de foot : Oh la la la ! l’accord inter-annotateurs ! et c’est le but ! (Manual Annotation of Football Matches : Inter-annotator Agreement ! Gooooal !) [in French]
Karën Fort | Vincent Claveau
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Analyzing the Impact of Prevalence on the Evaluation of a Manual Annotation Campaign
Karën Fort | Claire François | Olivier Galibert | Maha Ghribi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This article details work aiming at evaluating the quality of the manual annotation of gene renaming couples in scientific abstracts, which generates sparse annotations. To evaluate these annotations, we compare the results obtained using the commonly advocated inter-annotator agreement coefficients such as S, κ and π, the less known R, the weighted coefficients κω and α as well as the F-measure and the SER. We analyze to which extent they are relevant for our data. We then study the bias introduced by prevalence by changing the way the contingency table is built. We finally propose an original way to synthesize the results by computing distances between categories, based on the produced annotations.

pdf bib
Annotating Football Matches: Influence of the Source Medium on Manual Annotation
Karën Fort | Vincent Claveau
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present an annotation campaign of football (soccer) matches, from a heterogeneous text corpus of both match minutes and video commentary transcripts, in French. The data, annotations and evaluation process are detailed, and the quality of the annotated corpus is discussed. In particular, we propose a new technique to better estimate the annotator agreement when few elements of a text are to be annotated. Based on that, we show how the source medium influenced the process and the quality.

pdf bib
Structured Named Entities in two distinct press corpora: Contemporary Broadcast News and Old Newspapers
Sophie Rosset | Cyril Grouin | Karën Fort | Olivier Galibert | Juliette Kahn | Pierre Zweigenbaum
Proceedings of the Sixth Linguistic Annotation Workshop

pdf bib
Modeling the Complexity of Manual Annotation Tasks: a Grid of Analysis
Karën Fort | Adeline Nazarenko | Sophie Rosset
Proceedings of COLING 2012

pdf bib
Manual Corpus Annotation: Giving Meaning to the Evaluation Metrics
Yann Mathet | Antoine Widlöcher | Karën Fort | Claire François | Olivier Galibert | Cyril Grouin | Juliette Kahn | Sophie Rosset | Pierre Zweigenbaum
Proceedings of COLING 2012: Posters

2011

pdf bib
Proposal for an Extension of Traditional Named Entities: From Guidelines to Evaluation, an Overview
Cyril Grouin | Sophie Rosset | Pierre Zweigenbaum | Karën Fort | Olivier Galibert | Ludovic Quintard
Proceedings of the 5th Linguistic Annotation Workshop

pdf bib
BioNLP Shared Task 2011 – Bacteria Gene Interactions and Renaming
Julien Jourde | Alain-Pierre Manine | Philippe Veber | Karën Fort | Robert Bossy | Erick Alphonse | Philippe Bessières
Proceedings of BioNLP Shared Task 2011 Workshop

pdf bib
Last Words: Amazon Mechanical Turk: Gold Mine or Coal Mine?
Karën Fort | Gilles Adda | K. Bretonnel Cohen
Computational Linguistics, Volume 37, Issue 2 - June 2011

2010

pdf bib
Influence of Pre-Annotation on POS-Tagged Corpus Development
Karën Fort | Benoît Sagot
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib
FastKwic, an “Intelligent“ Concordancer Using FASTR
Véronika Lux-Pogodalla | Dominique Besagni | Karën Fort
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we introduce the FastKwic (Key Word In Context using FASTR), a new concordancer for French and English that does not require users to learn any particular request language. Built on FASTR, it shows them not only occurrences of the searched term but also of several morphological, morpho-syntactic and syntactic variants (for example, image enhancement, enhancement of image, enhancement of fingerprint image, image texture enhancement). Fastkwic is freely available. It consists of two UTF-8 compliant Perl modules that depend on several external tools and resources : FASTR, TreeTagger, Flemm (for French). Licenses of theses tools and resources permitting, the FastKwic package is nevertheless self-sufficient. FastKwic first modules is for terminological resource compilation. Its input is a list of terms - as required by FASTR. FastKwic second module is for processing concordances. It relies on FASTR again for indexing the input corpus with terms and their variants. Its output is a concordancer: for each term and its variants, the context of occurrence is provided.

2009

pdf bib
Towards a Methodology for Named Entities Annotation
Karën Fort | Maud Ehrmann | Adeline Nazarenko
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2008

pdf bib
A Toolchain for Grammarians
Bruno Guillaume | Joseph Le Roux | Jonathan Marchand | Guy Perrier | Karën Fort | Jennifer Planul
Coling 2008: Companion volume: Demonstrations

2007

pdf bib
PrepLex: A Lexicon of French Prepositions for Parsing
Karën Fort | Bruno Guillaume
Proceedings of the Fourth ACL-SIGSEM Workshop on Prepositions