Benoît Sagot

Also published as: Benoit Sagot


2020

pdf bib
Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell
Djamé Seddah | Farah Essaidi | Amal Fethi | Matthieu Futeral | Benjamin Muller | Pedro Javier Ortiz Suárez | Benoît Sagot | Abhishek Srivastava
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Made of 1500 sentences, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank is made freely available. It is supplemented with 50k unlabeled sentences collected from Common Crawl and web-crawled data using intensive data-mining techniques. Preliminary experiments demonstrate its usefulness for POS tagging and dependency parsing. We believe that what we present in this paper is useful beyond the low-resource language community. This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches.

pdf bib
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
Pedro Javier Ortiz Suárez | Laurent Romary | Benoît Sagot
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.

pdf bib
ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations
Fernando Alva-Manchego | Louis Martin | Antoine Bordes | Carolina Scarton | Benoît Sagot | Lucia Specia
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In order to simplify a sentence, human editors perform multiple rewriting transformations: they split it into several shorter sentences, paraphrase words (i.e. replacing complex words or phrases by simpler synonyms), reorder components, and/or delete information deemed unnecessary. Despite these varied range of possible text alterations, current models for automatic sentence simplification are evaluated using datasets that are focused on a single transformation, such as lexical paraphrasing or splitting. This makes it impossible to understand the ability of simplification models in more realistic settings. To alleviate this limitation, this paper introduces ASSET, a new dataset for assessing sentence simplification in English. ASSET is a crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations. Through quantitative and qualitative experiments, we show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task. Furthermore, we motivate the need for developing better methods for automatic evaluation using ASSET, since we show that current popular metrics may not be suitable when multiple simplification transformations are performed.

pdf bib
CamemBERT: a Tasty French Language Model
Louis Martin | Benjamin Muller | Pedro Javier Ortiz Suárez | Yoann Dupont | Laurent Romary | Éric de la Clergerie | Djamé Seddah | Benoît Sagot
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

pdf bib
Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0
Clémentine Fourrier | Benoît Sagot
Proceedings of the 12th Language Resources and Evaluation Conference

Diachronic lexical information is not only important in the field of historical linguistics, but is also increasingly used in NLP, most recently for machine translation of low resource languages. Therefore, there is a need for fine-grained, large-coverage and accurate etymological lexical resources. In this paper, we propose a set of guidelines to generate such resources, for each step of the life-cycle of an etymological lexicon: creation, update, evaluation, dissemination, and exploitation. To illustrate the guidelines, we introduce EtymDB 2.0, an etymological database automatically generated from the Wiktionary, which contains 1.8 million lexemes, linked by more than 700,000 fine-grained etymological relations, across 2,536 living and dead languages. We also introduce use cases for which EtymDB 2.0 could represent a key resource, such as phylogenetic tree generation, low resource machine translation or medieval languages study.

pdf bib
OFrLex: A Computational Morphological and Syntactic Lexicon for Old French
Gaël Guibon | Benoît Sagot
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper we describe our work on the development and enrichment of OFrLex, a freely available, large-coverage morphological and syntactic Old French lexicon. We rely on several heterogeneous language resources to extract structured and exploitable information. The extraction follows a semi-automatic procedure with substantial manual steps to respond to difficulties encountered while aligning lexical entries from distinct language resources. OFrLex aims at improving natural language processing tasks on Old French such as part-of-speech tagging and dependency parsing. We provide quantitative information on OFrLex and discuss its reliability. We also describe and evaluate a semi-automatic, word-embedding-based lexical enrichment process aimed at increasing the accuracy of the resource. Results of this extension technique will be manually validated in the near future, a step that will take advantage of OFrLex’s viewing, searching and editing interface, which is already accessible online.

pdf bib
Establishing a New State-of-the-Art for French Named Entity Recognition
Pedro Javier Ortiz Suárez | Yoann Dupont | Benjamin Muller | Laurent Romary | Benoît Sagot
Proceedings of the 12th Language Resources and Evaluation Conference

The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no large-scale French corpus with named entity annotations contain referential information, which complement the type and the span of each mention with an indication of the entity it refers to. We have manually annotated the French TreeBank with such information, after an automatic pre-annotation step. We sketch the underlying annotation guidelines and we provide a few figures about the resulting annotations.

pdf bib
Controllable Sentence Simplification
Louis Martin | Éric de la Clergerie | Benoît Sagot | Antoine Bordes
Proceedings of the 12th Language Resources and Evaluation Conference

Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on Sequence-to-Sequence models. As a result, users can condition the simplifications returned by a model on attributes such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these attributes allow out-of-the-box Sequence-to-Sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), establishes the state of the art at 41.87 SARI on the WikiLarge test set, a +1.42 improvement over the best previously reported score.

pdf bib
Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l’hétérogénéité des données d’entrainement (C AMEM BERT Contextual Language Models for French: Impact of Training Data Size and Heterogeneity )
Louis Martin | Benjamin Muller | Pedro Javier Ortiz Suárez | Yoann Dupont | Laurent Romary | Éric Villemonte de la Clergerie | Benoît Sagot | Djamé Seddah
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Les modèles de langue neuronaux contextuels sont désormais omniprésents en traitement automatique des langues. Jusqu’à récemment, la plupart des modèles disponibles ont été entraînés soit sur des données en anglais, soit sur la concaténation de données dans plusieurs langues. L’utilisation pratique de ces modèles — dans toutes les langues sauf l’anglais — était donc limitée. La sortie récente de plusieurs modèles monolingues fondés sur BERT (Devlin et al., 2019), notamment pour le français, a démontré l’intérêt de ces modèles en améliorant l’état de l’art pour toutes les tâches évaluées. Dans cet article, à partir d’expériences menées sur CamemBERT (Martin et al., 2019), nous montrons que l’utilisation de données à haute variabilité est préférable à des données plus uniformes. De façon plus surprenante, nous montrons que l’utilisation d’un ensemble relativement petit de données issues du web (4Go) donne des résultats aussi bons que ceux obtenus à partir d’ensembles de données plus grands de deux ordres de grandeurs (138Go).

pdf bib
French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus
Murielle Popa-Fabre | Pedro Javier Ortiz Suárez | Benoît Sagot | Éric de la Clergerie
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora

This paper investigates the impact of different types and size of training corpora on language models. By asking the fundamental question of quality versus quantity, we compare four French corpora by pre-training four different ELMos and evaluating them on dependency parsing, POS-tagging and Named Entities Recognition downstream tasks. We present and asses the relevance of a new balanced French corpus, CaBeRnet, that features a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative corpus will allow the language models to be more efficient, and therefore yield better evaluation scores on different evaluation sets and tasks. This paper offers three main contributions: (1) two newly built corpora: (a) CaBeRnet, a French Balanced Reference Corpus and (b) CBT-fr a domain-specific corpus having both oral and written style in youth literature, (2) five versions of ELMo pre-trained on differently built corpora, and (3) a whole array of computational results on downstream tasks that deepen our understanding of the effects of corpus balance and register in NLP evaluation.

pdf bib
Comparing Statistical and Neural Models for Learning Sound Correspondences
Clémentine Fourrier | Benoît Sagot
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

Cognate prediction and proto-form reconstruction are key tasks in computational historical linguistics that rely on the study of sound change regularity. Solving these tasks appears to be very similar to machine translation, though methods from that field have barely been applied to historical linguistics. Therefore, in this paper, we investigate the learnability of sound correspondences between a proto-language and daughter languages for two machine-translation-inspired models, one statistical, the other neural. We first carry out our experiments on plausible artificial languages, without noise, in order to study the role of each parameter on the algorithms respective performance under almost perfect conditions. We then study real languages, namely Latin, Italian and Spanish, to see if those performances generalise well. We show that both model types manage to learn sound changes despite data scarcity, although the best performing model type depends on several parameters such as the size of the training data, the ambiguity, and the prediction direction.

2019

pdf bib
Enhancing BERT for Lexical Normalization
Benjamin Muller | Benoit Sagot | Djamé Seddah
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Language model-based pre-trained representations have become ubiquitous in natural language processing. They have been shown to significantly improve the performance of neural models on a great variety of tasks. However, it remains unclear how useful those general models can be in handling non-canonical text. In this article, focusing on User Generated Content (UGC), we study the ability of BERT to perform lexical normalisation. Our contribution is simple: by framing lexical normalisation as a token prediction task, by enhancing its architecture and by carefully fine-tuning it, we show that BERT can be a competitive lexical normalisation model without the need of any UGC resources aside from 3,000 training sentences. To the best of our knowledge, it is the first work done in adapting and analysing the ability of this model to handle noisy UGC data.

pdf bib
Développement d’un lexique morphologique et syntaxique de l’ancien français (Development of a morphological and syntactic lexicon of Old French)
Benoît Sagot
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

Nous décrivons dans cet article notre travail de développement d’un lexique morphologique et syntaxique à grande échelle de l’ancien français pour le traitement automatique des langues. Nous nous sommes appuyés sur des ressources dictionnairiques et lexicales dans lesquelles l’extraction d’informations structurées et exploitables a nécessité des développements spécifiques. De plus, la mise en correspondance d’informations provenant de ces différentes sources a soulevé des difficultés. Nous donnons quelques indications quantitatives sur le lexique obtenu, et discutons de sa fiabilité dans sa version actuelle et des perspectives d’amélioration permises par l’existence d’une première version, notamment au travers de l’analyse automatique de données textuelles.

pdf bib
What Does BERT Learn about the Structure of Language?
Ganesh Jawahar | Benoît Sagot | Djamé Seddah
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

BERT is a recent language representation model that has surprisingly performed well in diverse language understanding benchmarks. This result indicates the possibility that BERT networks capture structural information about language. In this work, we provide novel support for this claim by performing a series of experiments to unpack the elements of English language structure learned by BERT. Our findings are fourfold. BERT’s phrasal representation captures the phrase-level information in the lower layers. The intermediate layers of BERT compose a rich hierarchy of linguistic information, starting with surface features at the bottom, syntactic features in the middle followed by semantic features at the top. BERT requires deeper layers while tracking subject-verb agreement to handle long-term dependency problem. Finally, the compositional scheme underlying BERT mimics classical, tree-like structures.

2018

pdf bib
A multilingual collection of CoNLL-U-compatible morphological lexicons
Benoît Sagot
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing
Amir More | Özlem Çetinoğlu | Çağrı Çöltekin | Nizar Habash | Benoît Sagot | Djamé Seddah | Dima Taji | Reut Tsarfaty
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Cheating a Parser to Death: Data-driven Cross-Treebank Annotation Transfer
Djamé Seddah | Eric de la Clergerie | Benoît Sagot | Héctor Martínez Alonso | Marie Candito
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
ELMoLex: Connecting ELMo and Lexicon Features for Dependency Parsing
Ganesh Jawahar | Benjamin Muller | Amal Fethi | Louis Martin | Éric Villemonte de la Clergerie | Benoît Sagot | Djamé Seddah
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

In this paper, we present the details of the neural dependency parser and the neural tagger submitted by our team ‘ParisNLP’ to the CoNLL 2018 Shared Task on parsing from raw text to Universal Dependencies. We augment the deep Biaffine (BiAF) parser (Dozat and Manning, 2016) with novel features to perform competitively: we utilize an indomain version of ELMo features (Peters et al., 2018) which provide context-dependent word representations; we utilize disambiguated, embedded, morphosyntactic features from lexicons (Sagot, 2018), which complements the existing feature set. Henceforth, we call our system ‘ELMoLex’. In addition to incorporating character embeddings, ELMoLex benefits from pre-trained word vectors, ELMo and morphosyntactic features (whenever available) to correctly handle rare or unknown words which are prevalent in languages with complex morphology. ELMoLex ranked 11th by Labeled Attachment Score metric (70.64%), Morphology-aware LAS metric (55.74%) and ranked 9th by Bilexical dependency metric (60.70%).

pdf bib
Reference-less Quality Estimation of Text Simplification Systems
Louis Martin | Samuel Humeau | Pierre-Emmanuel Mazaré | Éric de La Clergerie | Antoine Bordes | Benoît Sagot
Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)

2017

pdf bib
Construction automatique d’une base de données étymologiques à partir du wiktionary (Automatic construction of an etymological database using Wiktionary)
Benoît Sagot
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs

Les ressources lexicales électroniques ne contiennent quasiment jamais d’informations étymologiques. De telles informations, convenablement formalisées, permettraient pourtant de développer des outils automatiques au service de la linguistique historique et comparative, ainsi que d’améliorer significativement le traitement automatique de langues anciennes. Nous décrivons ici le processus que nous avons mis en œuvre pour extraire des données étymologiques à partir des notices étymologiques du wiktionary, rédigées en anglais. Nous avons ainsi produit une base multilingue de près d’un million de lexèmes et une base de plus d’un demi-million de relations étymologiques entre lexèmes.

pdf bib
The ParisNLP entry at the ConLL UD Shared Task 2017: A Tale of a #ParsingTragedy
Éric de La Clergerie | Benoît Sagot | Djamé Seddah
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present the ParisNLP entry at the UD CoNLL 2017 parsing shared task. In addition to the UDpipe models provided, we built our own data-driven tokenization models, sentence segmenter and lexicon-based morphological analyzers. All of these were used with a range of different parsing models (neural or not, feature-rich or not, transition or graph-based, etc.) and the best combination for each language was selected. Unfortunately, a glitch in the shared task’s Matrix led our model selector to run generic, weakly lexicalized models, tailored for surprise languages, instead of our dataset-specific models. Because of this #ParsingTragedy, we officially ranked 27th, whereas our real models finally unofficially ranked 6th.

pdf bib
Annotating omission in statement pairs
Héctor Martínez Alonso | Amaury Delamaire | Benoît Sagot
Proceedings of the 11th Linguistic Annotation Workshop

We focus on the identification of omission in statement pairs. We compare three annotation schemes, namely two different crowdsourcing schemes and manual expert annotation. We show that the simplest of the two crowdsourcing approaches yields a better annotation quality than the more complex one. We use a dedicated classifier to assess whether the annotators’ behavior can be explained by straightforward linguistic features. The classifier benefits from a modeling that uses lexical information beyond length and overlap measures. However, for our task, we argue that expert and not crowdsourcing-based annotation is the best compromise between annotation cost and quality.

pdf bib
Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin
Géraldine Walther | Benoît Sagot
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

In this paper, we present ongoing work for developing language resources and basic NLP tools for an undocumented variety of Romansh, in the context of a language documentation and language acquisition project. Our tools are meant to improve the speed and reliability of corpus annotations for noisy data involving large amounts of code-switching, occurrences of child-speech and orthographic noise. Being able to increase the efficiency of language resource development for language documentation and acquisition research also constitutes a step towards solving the data sparsity issues with which researchers have been struggling.

pdf bib
Improving neural tagging with lexical information
Benoît Sagot | Héctor Martínez Alonso
Proceedings of the 15th International Conference on Parsing Technologies

Neural part-of-speech tagging has achieved competitive results with the incorporation of character-based and pre-trained word embeddings. In this paper, we show that a state-of-the-art bi-LSTM tagger can benefit from using information from morphosyntactic lexicons as additional input. The tagger, trained on several dozen languages, shows a consistent, average improvement when using lexical information, even when also using character-based embeddings, thus showing the complementarity of the different sources of lexical information. The improvements are particularly important for the smaller datasets.

2016

pdf bib
From Noisy Questions to Minecraft Texts: Annotation Challenges in Extreme Syntax Scenario
Héctor Martínez Alonso | Djamé Seddah | Benoît Sagot
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

User-generated content presents many challenges for its automatic processing. While many of them do come from out-of-vocabulary effects, others spawn from different linguistic phenomena such as unusual syntax. In this work we present a French three-domain data set made up of question headlines from a cooking forum, game chat logs and associated forums from two popular online games (MINECRAFT & LEAGUE OF LEGENDS). We chose these domains because they encompass different degrees of lexical and syntactic compliance with canonical language. We conduct an automatic and manual evaluation of the difficulties of processing these domains for part-of-speech prediction, and introduce a pilot study to determine whether dependency analysis lends itself well to annotate these data. We also discuss the development cost of our data set.

pdf bib
Étiquetage multilingue en parties du discours avec MElt (Multilingual part-of-speech tagging with MElt)
Benoît Sagot
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Nous présentons des travaux récents réalisés autour de MElt, système discriminant d’étiquetage en parties du discours. MElt met l’accent sur l’exploitation optimale d’informations lexicales externes pour améliorer les performances des étiqueteurs par rapport aux modèles entraînés seulement sur des corpus annotés. Nous avons entraîné MElt sur plus d’une quarantaine de jeux de données couvrant plus d’une trentaine de langues. Comparé au système état-de-l’art MarMoT, MElt obtient en moyenne des résultats légèrement moins bons en l’absence de lexique externe, mais meilleurs lorsque de telles ressources sont disponibles, produisant ainsi des étiqueteurs état-de-l’art pour plusieurs langues.

2014

pdf bib
Automated Error Detection in Digitized Cultural Heritage Documents
Kata Gábor | Benoît Sagot
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf bib
Analogy-based Text Normalization : the case of unknowns words (Normalisation de textes par analogie: le cas des mots inconnus) [in French]
Marion Baranes | Benoît Sagot
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Sub-categorization in ‘pour’ and lexical syntax (Sous-catégorisation en pour et syntaxe lexicale) [in French]
Benoît Sagot | Laurence Danlos | Margot Colinet
Proceedings of TALN 2014 (Volume 2: Short Papers)

pdf bib
Named Entity Recognition and Correction in OCRized Corpora (Détection et correction automatique d’entités nommées dans des corpus OCRisés) [in French]
Benoît Sagot | Kata Gábor
Proceedings of TALN 2014 (Volume 2: Short Papers)

pdf bib
DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German
Benoît Sagot
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We introduce DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German developed within the Alexina framework. We extracted lexical information from the German wiktionary and developed a morphological inflection grammar for German, based on a linguistically sound model of inflectional morphology. Although the developement of DeLex involved some manual work, we show that is represents a good tradeoff between development cost, lexical coverage and resource accuracy.

pdf bib
A Language-independent Approach to Extracting Derivational Relations from an Inflectional Lexicon
Marion Baranes | Benoît Sagot
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe and evaluate an unsupervised method for acquiring pairs of lexical entries belonging to the same morphological family, i.e., derivationally related words, starting from a purely inflectional lexicon. Our approach relies on transformation rules that relate lexical entries with the one another, and which are automatically extracted from the inflected lexicon based on surface form analogies and on part-of-speech information. It is generic enough to be applied to any language with a mainly concatenative derivational morphology. Results were obtained and evaluated on English, French, German and Spanish. Precision results are satisfying, and our French results favorably compare with another resource, although its construction relied on manually developed lexicographic information whereas our approach only requires an inflectional lexicon.

pdf bib
Developing a French FrameNet: Methodology and First results
Marie Candito | Pascal Amsili | Lucie Barque | Farah Benamara | Gaël de Chalendar | Marianne Djemaa | Pauline Haas | Richard Huyghe | Yvette Yannick Mathieu | Philippe Muller | Benoît Sagot | Laure Vieu
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Asfalda project aims to develop a French corpus with frame-based semantic annotations and automatic tools for shallow semantic analysis. We present the first part of the project: focusing on a set of notional domains, we delimited a subset of English frames, adapted them to French data when necessary, and developed the corresponding French lexicon. We believe that working domain by domain helped us to enforce the coherence of the resulting resource, and also has the advantage that, though the number of frames is limited (around a hundred), we obtain full coverage within a given domain.

pdf bib
An Open-Source Heavily Multilingual Translation Graph Extracted from Wiktionaries and Parallel Corpora
Valérie Hanoka | Benoît Sagot
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes YaMTG (Yet another Multilingual Translation Graph), a new open-source heavily multilingual translation database (over 664 languages represented) built using several sources, namely various wiktionaries and the OPUS parallel corpora (Tiedemann, 2009). We detail the translation extraction process for 21 wiktionary language editions, and provide an evaluation of the translations contained in YaMTG.

pdf bib
A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages
Yves Scherrer | Benoît Sagot
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe our generic approach for transferring part-of-speech annotations from a resourced language towards an etymologically closely related non-resourced language, without using any bilingual (i.e., parallel) data. We first induce a translation lexicon from monolingual corpora, based on cognate detection followed by cross-lingual contextual similarity. Second, POS information is transferred from the resourced language along translation pairs to the non-resourced language and used for tagging the corpus. We evaluate our methods on three language families, consisting of five Romance languages, three Germanic languages and five Slavic languages. We obtain tagging accuracies of up to 91.6%.

2013

pdf bib
Can MDL Improve Unsupervised Chinese Word Segmentation?
Pierre Magistry | Benoît Sagot
Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing

pdf bib
Lexicon induction and part-of-speech tagging of non-resourced languages without any bilingual resources
Yves Scherrer | Benoît Sagot
Proceedings of the Workshop on Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants

pdf bib
Enforcing Subcategorization Constraints in a Parser Using Sub-parses Recombining
Seyed Abolghasem Mirroshandel | Alexis Nasr | Benoît Sagot
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Dynamic extension of a French morphological lexicon based a text stream (Extension dynamique de lexiques morphologiques pour le français à partir d’un flux textuel) [in French]
Benoît Sagot | Damien Nouvel | Virginie Mouilleron | Marion Baranes
Proceedings of TALN 2013 (Volume 1: Long Papers)

2012

pdf bib
TCOF-POS : un corpus libre de français parlé annoté en morphosyntaxe (TCOF-POS : A Freely Available POS-Tagged Corpus of Spoken French) [in French]
Christophe Benzitoun | Karën Fort | Benoît Sagot
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Annotation référentielle du Corpus Arboré de Paris 7 en entités nommées (Referential named entity annotation of the Paris 7 French TreeBank) [in French]
Benoît Sagot | Marion Richard | Rosa Stern
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Unsupervized Word Segmentation: the Case for Mandarin Chinese
Pierre Magistry | Benoît Sagot
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Applying cross-lingual WSD to wordnet development
Marianna Apidianaki | Benoît Sagot
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The automatic development of semantic resources constitutes an important challenge in the NLP community. The methods used generally exploit existing large-scale resources, such as Princeton WordNet, often combined with information extracted from multilingual resources and parallel corpora. In this paper we show how Cross-Lingual Word Sense Disambiguation can be applied to wordnet development. We apply the proposed method to WOLF, a free wordnet for French still under construction, in order to fill synsets that did not contain any literal yet and increase its coverage.

pdf bib
Evaluating and improving syntactic lexica by plugging them within a parser
Elsa Tolone | Benoît Sagot | Éric Villemonte de La Clergerie
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present some evaluation results for four French syntactic lexica, obtained through their conversion to the Alexina format used by the Lefff lexicon, and their integration within the large-coverage TAG-based FRMG parser. The evaluations are run on two test corpora, annotated with two distinct annotation formats, namely EASy/Passage chunks and relations and CoNLL dependencies. The information provided by the evaluation results provide valuable feedback about the four lexica. Moreover, when coupled with error mining techniques, they allow us to identify how these lexica might be improved.

pdf bib
Boosting the Coverage of a Semantic Lexicon by Automatically Extracted Event Nominalizations
Kata Gábor | Marianna Apidianaki | Benoît Sagot | Éric Villemonte de La Clergerie
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this article, we present a distributional analysis method for extracting nominalization relations from monolingual corpora. The acquisition method makes use of distributional and morphological information to select nominalization candidates. We explain how the learning is performed on a dependency annotated corpus and describe the nominalization results. Furthermore, we show how these results served to enrich an existing lexical resource, the WOLF (Wordnet Libre du Franc¸ais). We present the techniques that we developed in order to integrate the new information into WOLF, based on both its structure and content. Finally, we evaluate the validity of the automatically obtained information and the correctness of its integration into the semantic resource. The method proved to be useful for boosting the coverage of WOLF and presents the advantage of filling verbal synsets, which are particularly difficult to handle due to the high level of verbal polysemy.

pdf bib
Aleda, a free large-scale entity database for French
Benoît Sagot | Rosa Stern
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Named entity recognition, which focuses on the identification of the span and type of named entity mentions in texts, has drawn the attention of the NLP community for a long time. However, many real-life applications need to know which real entity each mention refers to. For such a purpose, often refered to as entity resolution and linking, an inventory of entities is required in order to constitute a reference. In this paper, we describe how we extracted such a resource for French from freely available resources (the French Wikipedia and the GeoNames database). We describe the results of an instrinsic evaluation of the resulting entity database, named Aleda, as well as those of a task-based evaluation in the context of a named entity detection system. We also compare it with the NLGbAse database (Charton and Torres-Moreno, 2010), a resource with similar objectives.

pdf bib
Cleaning noisy wordnets
Benoît Sagot | Darja Fišer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Automatic approaches to creating and extending wordnets, which have become very popular in the past decade, inadvertently result in noisy synsets. This is why we propose an approach to detect synset outliers in order to eliminate the noise and improve accuracy of the developed wordnets, so that they become more useful lexico-semantic resources for natural language applications. The approach compares the words that appear in the synset and its surroundings with the contexts of the literals in question they are used in based on large monolingual corpora. By fine-tuning the outlier threshold we can influence how many outlier candidates will be eliminated. Although the proposed approach is language-independent we test it on Slovene and French that were created automatically from bilingual resources and contain plenty of disambiguation errors. Manual evaluation of the results shows that by applying a threshold similar to the estimated error rate in the respective wordnets, 67% of the proposed outlier candidates are indeed incorrect for French and a 64% for Slovene. This is a big improvement compared to the estimated overall error rates in the resources, which are 12% for French and 15% for Slovene.

pdf bib
Wordnet extension made simple: A multilingual lexicon-based approach using wiki resources
Valérie Hanoka | Benoît Sagot
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we propose a simple methodology for building or extending wordnets using easily extractible lexical knowledge from Wiktionary and Wikipedia. This method relies on a large multilingual translation/synonym graph in many languages as well as synset-aligned wordnets. It guesses frequent and polysemous literals that are difficult to find using other methods by looking at back-translations in the graph, showing that the use of a heavily multilingual lexicon can be a way to mitigate the lack of wide coverage bilingual lexicon for wordnet creation or extension. We evaluate our approach on French by applying it for extending WOLF, a freely available French wordnet.

pdf bib
A Joint Named Entity Recognition and Entity Linking System
Rosa Stern | Benoît Sagot | Frédéric Béchet
Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data

pdf bib
Population of a Knowledge Base for News Metadata from Unstructured Text and Web Data
Rosa Stern | Benoît Sagot
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)

pdf bib
Statistical Parsing of Spanish and Data Driven Lemmatization
Joseph Le Roux | Benoît Sagot | Djamé Seddah
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

pdf bib
Dictionary-ontology cross-enrichment
Emmanuel Eckard | Lucie Barque | Alexis Nasr | Benoît Sagot
Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon

pdf bib
The French Social Media Bank: a Treebank of Noisy User Generated Content
Djamé Seddah | Benoit Sagot | Marie Candito | Virginie Mouilleron | Vanessa Combet
Proceedings of COLING 2012

2010

pdf bib
Influence of Pre-Annotation on POS-Tagged Corpus Development
Karën Fort | Benoît Sagot
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib
Control Verb, Argument Cluster Coordination and Multi Component TAG
Djamé Seddah | Benoit Sagot | Laurence Danlos
Proceedings of the 10th International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+10)

pdf bib
Optimal Rank Reduction for Linear Context-Free Rewriting Systems with Fan-Out Two
Benoît Sagot | Giorgio Satta
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
A Lexicon of French Quotation Verbs for Automatic Quotation Extraction
Benoît Sagot | Laurence Danlos | Rosa Stern
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Quotation extraction is an important information extraction task, especially when dealing with news wires. Quotations can be found in various configurations. In this paper, we focus on direct quotations introduced by a parenthetical clause, headed by a ""quotation verb"". Our study is based on a large French news wire corpus from the Agence France-Presse. We introduce and motivate an analysis at the discursive level of such quotations, which differs from the syntactic analyses generally proposed. We show how we enriched the Lefff syntactic lexicon so that it provides an account for quotation verbs heading a quotation parenthetical, especially those extracted from a news wire corpus. We also sketch how these lexical entries can be extended to the discursive level in order to model quotations introduced in a parenthetical clause in a complete way.

pdf bib
A Morphological Lexicon for the Persian Language
Benoît Sagot | Géraldine Walther
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We introduce PerLex, a large-coverage and freely-available morphological lexicon for the Persian language. We describe the main features of the Persian morphology, and the way we have represented it within the Alexina formalism, on which PerLex is based. We focus on the methodology we used for constructing lexical entries from various sources, as well as the problems related to typographic normalisation. The resulting lexicon shows a satisfying coverage on a reference corpus and should therefore be a good starting point for developing a syntactic lexicon for the Persian language.

pdf bib
The Lefff, a Freely Available and Large-coverage Morphological and Syntactic Lexicon for French
Benoît Sagot
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we introduce the Lefff, a freely available, accurate and large-coverage morphological and syntactic lexicon for French, used in many NLP tools such as large-coverage parsers. We first describe Alexina, the lexical framework in which the Lefff is developed as well as the linguistic notions and formalisms it is based on. Next, we describe the various sources of lexical data we used for building the Lefff, in particular semi-automatic lexical development techniques and conversion and merging of existing resources. Finally, we illustrate the coverage and precision of the resource by comparing it with other resources and by assessing its impact in various NLP tools.

2009

pdf bib
A Morphological and Syntactic Wide-coverage Lexicon for Spanish: The Leffe
Miguel A. Molinero | Benoît Sagot | Lionel Nicolas
Proceedings of the International Conference RANLP-2009

pdf bib
Towards Efficient Production of Linguistic Resources: the Victoria Project
Lionel Nicolas | Miguel A. Molinero | Benoît Sagot | Elena Trigo | Éric de La Clergerie | Miguel Alonso Pardo | Jacques Farré | Joan Miquel Vergés
Proceedings of the International Conference RANLP-2009

pdf bib
Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort
Pascal Denis | Benoît Sagot
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

pdf bib
Constructing parse forests that include exactly the n-best PCFG trees
Pierre Boullier | Alexis Nasr | Benoît Sagot
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf bib
Parsing Directed Acyclic Graphs with Range Concatenation Grammars
Pierre Boullier | Benoît Sagot
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf bib
Building a morphological and syntactic lexicon by merging various linguistic resources
Miguel A. Molinero | Benoît Sagot | Lionel Nicolas
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf bib
MICA: A Probabilistic Dependency Parser Based on Tree Insertion Grammars (Application Note)
Srinivas Bangalore | Pierre Boullier | Alexis Nasr | Owen Rambow | Benoît Sagot
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

2008

pdf bib
Computer Aided Correction and Extension of a Syntactic Wide-Coverage Lexicon
Lionel Nicolas | Benoît Sagot | Miguel A. Molinero | Jacques Farré | Éric de la Clergerie
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf bib
Are Very Large Context-Free Grammars Tractable?
Pierre Boullier | Benoît Sagot
Proceedings of the Tenth International Conference on Parsing Technologies

2006

pdf bib
Error Mining in Parsing Results
Benoît Sagot | Éric de la Clergerie
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
Deep non-probabilistic parsing of large corpora
Benoît Sagot | Pierre Boullier
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper reports a large-scale non-probabilistic parsing experiment with a deep LFG parser. We briefly introduce the parser we used, named SXLFG, and the resources that were used together with it. Then we report quantitative results about the parsing of a multi-million word journalistic corpus. We show that we can parse more than 6 million words in less than 12 hours, only 6.7% of all sentences reaching the 1s timeout. This shows that deep large-coverage non-probabilistic parsers can be efficient enough to parse very large corpora in a reasonable amount of time.

pdf bib
The Lefff 2 syntactic lexicon for French: architecture, acquisition, use
Benoît Sagot | Lionel Clément | Éric Villemonte de La Clergerie | Pierre Boullier
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we introduce a new lexical resource for French which is freely available as the second version of the Lefff (Lexique des formes fléchies du français - Lexicon of French inflected forms). It is a wide-coverage morphosyntactic and syntactic lexicon, whose architecture relies on properties inheritance, which makes it more compact and more easily maintainable and allows to describe lexical entries independantly from the formalisms it is used for. For these two reasons, we define it as a meta-lexicon. We describe its architecture, several automatic or semi-automatic approaches we use to acquire, correct and/or enrich such a lexicon, as well as the way it is used both with an LFG parser and with a TAG parser based on a meta-grammar, so as to build two large-coverage parsers for French. The web site of the Lefff is http://www.lefff.net/.

pdf bib
Modeling and Analysis of Elliptic Coordination by Dynamic Exploitation of Derivation Forests in LTAG Parsing
Djamé Seddah | Benoît Sagot
Proceedings of the Eighth International Workshop on Tree Adjoining Grammar and Related Formalisms

2005

pdf bib
Efficient and Robust LFG Parsing: SxLFG
Pierre Boullier | Benoît Sagot
Proceedings of the Ninth International Workshop on Parsing Technology

2004

pdf bib
Morphology Based Automatic Acquisition of Large-coverage Lexica
Lionel Clément | Benoît Sagot | Bernard Lang
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)