Éric Villemonte de la Clergerie

Also published as: Éric de La Clergerie, Eric de La Clergerie, Eric Villemonte de la Clergerie, Eric de la Clergerie, Éric de la Clergerie, Éric Villemonte de La Clergerie


2020

pdf bib
CamemBERT: a Tasty French Language Model
Louis Martin | Benjamin Muller | Pedro Javier Ortiz Suárez | Yoann Dupont | Laurent Romary | Éric de la Clergerie | Djamé Seddah | Benoît Sagot
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

pdf bib
Controllable Sentence Simplification
Louis Martin | Éric de la Clergerie | Benoît Sagot | Antoine Bordes
Proceedings of the 12th Language Resources and Evaluation Conference

Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on Sequence-to-Sequence models. As a result, users can condition the simplifications returned by a model on attributes such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these attributes allow out-of-the-box Sequence-to-Sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), establishes the state of the art at 41.87 SARI on the WikiLarge test set, a +1.42 improvement over the best previously reported score.

pdf bib
Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l’hétérogénéité des données d’entrainement (C AMEM BERT Contextual Language Models for French: Impact of Training Data Size and Heterogeneity )
Louis Martin | Benjamin Muller | Pedro Javier Ortiz Suárez | Yoann Dupont | Laurent Romary | Éric Villemonte de la Clergerie | Benoît Sagot | Djamé Seddah
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Les modèles de langue neuronaux contextuels sont désormais omniprésents en traitement automatique des langues. Jusqu’à récemment, la plupart des modèles disponibles ont été entraînés soit sur des données en anglais, soit sur la concaténation de données dans plusieurs langues. L’utilisation pratique de ces modèles — dans toutes les langues sauf l’anglais — était donc limitée. La sortie récente de plusieurs modèles monolingues fondés sur BERT (Devlin et al., 2019), notamment pour le français, a démontré l’intérêt de ces modèles en améliorant l’état de l’art pour toutes les tâches évaluées. Dans cet article, à partir d’expériences menées sur CamemBERT (Martin et al., 2019), nous montrons que l’utilisation de données à haute variabilité est préférable à des données plus uniformes. De façon plus surprenante, nous montrons que l’utilisation d’un ensemble relativement petit de données issues du web (4Go) donne des résultats aussi bons que ceux obtenus à partir d’ensembles de données plus grands de deux ordres de grandeurs (138Go).

pdf bib
French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus
Murielle Popa-Fabre | Pedro Javier Ortiz Suárez | Benoît Sagot | Éric de la Clergerie
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora

This paper investigates the impact of different types and size of training corpora on language models. By asking the fundamental question of quality versus quantity, we compare four French corpora by pre-training four different ELMos and evaluating them on dependency parsing, POS-tagging and Named Entities Recognition downstream tasks. We present and asses the relevance of a new balanced French corpus, CaBeRnet, that features a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative corpus will allow the language models to be more efficient, and therefore yield better evaluation scores on different evaluation sets and tasks. This paper offers three main contributions: (1) two newly built corpora: (a) CaBeRnet, a French Balanced Reference Corpus and (b) CBT-fr a domain-specific corpus having both oral and written style in youth literature, (2) five versions of ELMo pre-trained on differently built corpora, and (3) a whole array of computational results on downstream tasks that deepen our understanding of the effects of corpus balance and register in NLP evaluation.

2019

pdf bib
Challenges of language change and variation: towards an extended treebank of Medieval French
Mathilde Regnault | Sophie Prévost | Eric Villemonte de la Clergerie
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

pdf bib
INRIA at SemEval-2019 Task 9: Suggestion Mining Using SVM with Handcrafted Features
Ilia Markov | Eric Villemonte de la Clergerie
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the INRIA approach to the suggestion mining task at SemEval 2019. The task consists of two subtasks: suggestion mining under single-domain (Subtask A) and cross-domain (Subtask B) settings. We used the Support Vector Machines algorithm trained on handcrafted features, function words, sentiment features, digits, and verbs for Subtask A, and handcrafted features for Subtask B. Our best run archived a F1-score of 51.18% on Subtask A, and ranked in the top ten of the submissions for Subtask B with 73.30% F1-score.

2018

pdf bib
ANCOR-AS: Enriching the ANCOR Corpus with Syntactic Annotations
Loïc Grobol | Isabelle Tellier | Éric de la Clergerie | Marco Dinarelli | Frédéric Landragin
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Cheating a Parser to Death: Data-driven Cross-Treebank Annotation Transfer
Djamé Seddah | Eric de la Clergerie | Benoît Sagot | Héctor Martínez Alonso | Marie Candito
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
ELMoLex: Connecting ELMo and Lexicon Features for Dependency Parsing
Ganesh Jawahar | Benjamin Muller | Amal Fethi | Louis Martin | Éric Villemonte de la Clergerie | Benoît Sagot | Djamé Seddah
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

In this paper, we present the details of the neural dependency parser and the neural tagger submitted by our team ‘ParisNLP’ to the CoNLL 2018 Shared Task on parsing from raw text to Universal Dependencies. We augment the deep Biaffine (BiAF) parser (Dozat and Manning, 2016) with novel features to perform competitively: we utilize an indomain version of ELMo features (Peters et al., 2018) which provide context-dependent word representations; we utilize disambiguated, embedded, morphosyntactic features from lexicons (Sagot, 2018), which complements the existing feature set. Henceforth, we call our system ‘ELMoLex’. In addition to incorporating character embeddings, ELMoLex benefits from pre-trained word vectors, ELMo and morphosyntactic features (whenever available) to correctly handle rare or unknown words which are prevalent in languages with complex morphology. ELMoLex ranked 11th by Labeled Attachment Score metric (70.64%), Morphology-aware LAS metric (55.74%) and ranked 9th by Bilexical dependency metric (60.70%).

pdf bib
Reference-less Quality Estimation of Text Simplification Systems
Louis Martin | Samuel Humeau | Pierre-Emmanuel Mazaré | Éric de La Clergerie | Antoine Bordes | Benoît Sagot
Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)

2017

pdf bib
Apports des analyses syntaxiques pour la détection automatique de mentions dans un corpus de français oral (Experiences in using deep and shallow parsing to detect entity mentions in oral French)
Loïc Grobol | Isabelle Tellier | Éric de La Clergerie | Marco Dinarelli | Frédéric Landragin
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Cet article présente trois expériences de détection de mentions dans un corpus de français oral : ANCOR. Ces expériences utilisent des outils préexistants d’analyse syntaxique du français et des méthodes issues de travaux sur la coréférence, les anaphores et la détection d’entités nommées. Bien que ces outils ne soient pas optimisés pour le traitement de l’oral, la qualité de la détection des mentions que nous obtenons est comparable à l’état de l’art des systèmes conçus pour l’écrit dans d’autres langues. Nous concluons en proposant des perspectives pour l’amélioration des résultats que nous obtenons et la construction d’un système end-to-end pour lequel nos expériences peuvent servir de base de travail.

pdf bib
The ParisNLP entry at the ConLL UD Shared Task 2017: A Tale of a #ParsingTragedy
Éric de La Clergerie | Benoît Sagot | Djamé Seddah
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present the ParisNLP entry at the UD CoNLL 2017 parsing shared task. In addition to the UDpipe models provided, we built our own data-driven tokenization models, sentence segmenter and lexicon-based morphological analyzers. All of these were used with a range of different parsing models (neural or not, feature-rich or not, transition or graph-based, etc.) and the best combination for each language was selected. Unfortunately, a glitch in the shared task’s Matrix led our model selector to run generic, weakly lexicalized models, tailored for surprise languages, instead of our dataset-specific models. Because of this #ParsingTragedy, we officially ranked 27th, whereas our real models finally unofficially ranked 6th.

2016

pdf bib
Accurate Deep Syntactic Parsing of Graphs: The Case of French
Corentin Ribeyre | Eric Villemonte de la Clergerie | Djamé Seddah
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Parsing predicate-argument structures in a deep syntax framework requires graphs to be predicted. Argument structures represent a higher level of abstraction than the syntactic ones and are thus more difficult to predict even for highly accurate parsing models on surfacic syntax. In this paper we investigate deep syntax parsing, using a French data set (Ribeyre et al., 2014a). We demonstrate that the use of topologically different types of syntactic features, such as dependencies, tree fragments, spines or syntactic paths, brings a much needed context to the parser. Our higher-order parsing model, gaining thus up to 4 points, establishes the state of the art for parsing French deep syntactic structures.

2015

pdf bib
Because Syntax Does Matter: Improving Predicate-Argument Structures Parsing with Syntactic Features
Corentin Ribeyre | Eric Villemonte de la Clergerie | Djamé Seddah
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Alpage: Transition-based Semantic Graph Parsing with Syntactic Features
Corentin Ribeyre | Eric Villemonte de la Clergerie | Djamé Seddah
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
Playing with parsers (Jouer avec des analyseurs syntaxiques) [in French]
Éric Villemonte de la Clergerie
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Deep Syntax Annotation of the Sequoia French Treebank
Marie Candito | Guy Perrier | Bruno Guillaume | Corentin Ribeyre | Karën Fort | Djamé Seddah | Éric de la Clergerie
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We define a deep syntactic representation scheme for French, which abstracts away from surface syntactic variation and diathesis alternations, and describe the annotation of deep syntactic representations on top of the surface dependency trees of the Sequoia corpus. The resulting deep-annotated corpus, named deep-sequoia, is freely available, and hopefully useful for corpus linguistics studies and for training deep analyzers to prepare semantic analysis.

pdf bib
Towards an environment for the production and the validation of lexical semantic resources
Mikaël Morardo | Éric Villemonte de la Clergerie
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present the components of a processing chain for the creation, visualization, and validation of lexical resources (formed of terms and relations between terms). The core of the chain is a component for building lexical networks relying on Harris’ distributional hypothesis applied on the syntactic dependencies produced by the French parser FRMG on large corpora. Another important aspect concerns the use of an online interface for the visualization and collaborative validation of the resulting resources.

2013

pdf bib
Exploring beam-based shift-reduce dependency parsing with DyALog: Results from the SPMRL 2013 shared task
Éric de la Clergerie
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf bib
Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages
Djamé Seddah | Reut Tsarfaty | Sandra Kübler | Marie Candito | Jinho D. Choi | Richárd Farkas | Jennifer Foster | Iakes Goenaga | Koldo Gojenola Galletebeitia | Yoav Goldberg | Spence Green | Nizar Habash | Marco Kuhlmann | Wolfgang Maier | Joakim Nivre | Adam Przepiórkowski | Ryan Roth | Wolfgang Seeker | Yannick Versley | Veronika Vincze | Marcin Woliński | Alina Wróblewska | Eric Villemonte de la Clergerie
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf bib
Improving a symbolic parser through partially supervised learning
Éric de la Clergerie
Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013)

2012

pdf bib
Evaluating and improving syntactic lexica by plugging them within a parser
Elsa Tolone | Benoît Sagot | Éric Villemonte de La Clergerie
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present some evaluation results for four French syntactic lexica, obtained through their conversion to the Alexina format used by the Lefff lexicon, and their integration within the large-coverage TAG-based FRMG parser. The evaluations are run on two test corpora, annotated with two distinct annotation formats, namely EASy/Passage chunks and relations and CoNLL dependencies. The information provided by the evaluation results provide valuable feedback about the four lexica. Moreover, when coupled with error mining techniques, they allow us to identify how these lexica might be improved.

pdf bib
Boosting the Coverage of a Semantic Lexicon by Automatically Extracted Event Nominalizations
Kata Gábor | Marianna Apidianaki | Benoît Sagot | Éric Villemonte de La Clergerie
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this article, we present a distributional analysis method for extracting nominalization relations from monolingual corpora. The acquisition method makes use of distributional and morphological information to select nominalization candidates. We explain how the learning is performed on a dependency annotated corpus and describe the nominalization results. Furthermore, we show how these results served to enrich an existing lexical resource, the WOLF (Wordnet Libre du Franc¸ais). We present the techniques that we developed in order to integrate the new information into WOLF, based on both its structure and content. Finally, we evaluate the validity of the automatically obtained information and the correctness of its integration into the semantic resource. The method proved to be useful for boosting the coverage of WOLF and presents the advantage of filling verbal synsets, which are particularly difficult to handle due to the high level of verbal polysemy.

pdf bib
A linguistically-motivated 2-stage Tree to Graph Transformation
Corentin Ribeyre | Djamé Seddah | Eric Villemonte de la Clergerie
Proceedings of the 11th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+11)

2010

pdf bib
Building factorized TAGs with meta-grammars
Éric Villemonte de la Clergerie
Proceedings of the 10th International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+10)

pdf bib
The Second Evaluation Campaign of PASSAGE on Parsing of French
Patrick Paroubek | Olivier Hamon | Eric de La Clergerie | Cyril Grouin | Anne Vilnat
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

pdf bib
PASSAGE Syntactic Representation: a Minimal Common Ground for Evaluation
Anne Vilnat | Patrick Paroubek | Eric Villemonte de la Clergerie | Gil Francopoulo | Marie-Laure Guénot
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The current PASSAGE syntactic representation is the result of 9 years of constant evolution with the aim of providing a common ground for evaluating parsers of French whatever their type and supporting theory. In this paper we present the latest developments concerning the formalism and show first through a review of basic linguistic phenomena that it is a plausible minimal common ground for representing French syntax in the context of generic black box quantitative objective evaluation. For the phenomena reviewed, which include: the notion of syntactic head, apposition, control and coordination, we explain how PASSAGE representation relates to other syntactic representation schemes for French and English, slightly extending the annotation to address English when needed. Second, we describe the XML format chosen for PASSAGE and show that it is compliant with the latest propositions in terms of linguistic annotation standard. We conclude discussing the influence that corpus-based evaluation has on the characteristics of syntactic representation when willing to assess the performance of any kind of parser.

2009

pdf bib
Towards Efficient Production of Linguistic Resources: the Victoria Project
Lionel Nicolas | Miguel A. Molinero | Benoît Sagot | Elena Trigo | Éric de La Clergerie | Miguel Alonso Pardo | Jacques Farré | Joan Miquel Vergés
Proceedings of the International Conference RANLP-2009

pdf bib
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)
Harry Bunt | Éric Villemonte de la Clergerie
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

2008

pdf bib
PASSAGE: from French Parser Evaluation to Large Sized Treebank
Éric Villemonte de la Clergerie | Olivier Hamon | Djamel Mostefa | Christelle Ayache | Patrick Paroubek | Anne Vilnat
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present the PASSAGE project which aims at building automatically a French Treebank of large size by combining the output of several parsers, using the EASY annotation scheme. We present also the results of the of the first evaluation campaign of the project and the preliminary results we have obtained with our ROVER procedure for combining parsers automatically.

pdf bib
Large Scale Production of Syntactic Annotations to Move Forward
Anne Vilnat | Gil Francopoulo | Olivier Hamon | Sylvain Loiseau | Patrick Paroubek | Eric Villemonte de la Clergerie
Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation

pdf bib
Computer Aided Correction and Extension of a Syntactic Wide-Coverage Lexicon
Lionel Nicolas | Benoît Sagot | Miguel A. Molinero | Jacques Farré | Éric de la Clergerie
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2006

pdf bib
Error Mining in Parsing Results
Benoît Sagot | Éric de la Clergerie
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
The Lefff 2 syntactic lexicon for French: architecture, acquisition, use
Benoît Sagot | Lionel Clément | Éric Villemonte de La Clergerie | Pierre Boullier
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we introduce a new lexical resource for French which is freely available as the second version of the Lefff (Lexique des formes fléchies du français - Lexicon of French inflected forms). It is a wide-coverage morphosyntactic and syntactic lexicon, whose architecture relies on properties inheritance, which makes it more compact and more easily maintainable and allows to describe lexical entries independantly from the formalisms it is used for. For these two reasons, we define it as a meta-lexicon. We describe its architecture, several automatic or semi-automatic approaches we use to acquire, correct and/or enrich such a lexicon, as well as the way it is used both with an LFG parser and with a TAG parser based on a meta-grammar, so as to build two large-coverage parsers for French. The web site of the Lefff is http://www.lefff.net/.

2005

pdf bib
From metagrammars to factorized TAG/TIG parsers
Éric Villemonte de la Clergerie
Proceedings of the Ninth International Workshop on Parsing Technology

2004

pdf bib
Towards an International Standard on Feature Structure Representation
Kiyong Lee | Lou Burnard | Laurent Romary | Eric de la Clergerie | Thierry Declerck | Syd Bauman | Harry Bunt | Lionel Clément | Tomaž Erjavec | Azim Roussanaly | Claude Roux
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

pdf bib
International Standard for a Linguistic Annotation Framework
Nancy Ide | Laurent Romary | Eric de la Clergerie
Proceedings of the HLT-NAACL 2003 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS)

2002

pdf bib
Parsing MCS languages with Thread Automata
Éric Villemonte de la Clergerie
Proceedings of the Sixth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+6)

pdf bib
Parsing Mildly Context-Sensitive Languages with Thread Automata
Éric Villemonte de la Clergerie
COLING 2002: The 19th International Conference on Computational Linguistics

2001

bib
Refining Tabular Parsers for TAGs
Éric Villemonte de la Clergerie
Second Meeting of the North American Chapter of the Association for Computational Linguistics

pdf bib
Guided Parsing of Range Concatenation Languages
François Barthélemy | Pierre Boullier | Philippe Deschamp | Éric Villemonte de la Clergerie
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

pdf bib
Tools and resources for Tree Adjoining Grammars
François Barthélemy | Pierre Bouiller | Philippe Deschamp | Linda Kaouane | Éric Villemonte de la Clergerie
Proceedings of the ACL 2001 Workshop on Sharing Tools and Resources

2000

pdf bib
New Tabular Algorithms for Parsing
Miguel A. Alonso | Jorge Graña | Manuel Vilares | Eric de la Clergerie
Proceedings of the Sixth International Workshop on Parsing Technologies

We develop a set of new tabular parsing algorithms for Linear Indexed Grammars, including bottom-up algorithms and Earley-like algorithms with and without the valid prefix property, creating a continuum in which one algorithm can in turn be derived from another. The output of these algorithms is a shared forest in the form of a context-free grammar that encodes all possible derivations for a given input string.

pdf bib
A redefinition of Embedded Push-Down Automata
Miguel A. Alonso | Éric Villemonte de la Clergerie | Manuel Vilares
Proceedings of the Fifth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+5)

pdf bib
Practical aspects in compiling tabular TAG parsers
Miguel A. Alonso | Djamé Seddah | Éric Villemonte de la Clergerie
Proceedings of the Fifth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+5)

1999

pdf bib
Tabular Algorithms for TAG Parsing
Miguel A. Alonso | David Cabrero | Eric de la Clergerie | Manuel Vilares
Ninth Conference of the European Chapter of the Association for Computational Linguistics

1998

pdf bib
A tabular interpretation of a class of 2-Stack Automata
Eric Villemonte de la Clergerie | Miguel Alonso Pardo
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

pdf bib
A tabular interpretation of a class of 2-Stack Automata
Eric Villemonte de la Clergerie | Miguel Alonso Pardo
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

pdf bib
A tabular interpretation of bottom-up automata for TAG
Eric de la Clergerie | Miguel A. Alonso Pardo | David Cabrero Souto
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)