Ludovic Tanguy


2020

pdf bib
Which Dependency Parser to Use for Distributional Semantics in a Specialized Domain?
Pauline Brunet | Olivier Ferret | Ludovic Tanguy
Proceedings of the 6th International Workshop on Computational Terminology

We present a study whose objective is to compare several dependency parsers for English applied to a specialized corpus for building distributional count-based models from syntactic dependencies. One of the particularities of this study is to focus on the concepts of the target domain, which mainly occur in documents as multi-terms and must be aligned with the outputs of the parsers. We compare a set of ten parsers in terms of syntactic triplets but also in terms of distributional neighbors extracted from the models built from these triplets, both with and without an external reference concerning the semantic relations between concepts. We show more particularly that some patterns of proximity between these parsers can be observed across our different evaluations, which could give insights for anticipating the performance of a parser for building distributional models from a given corpus

pdf bib
Extrinsic Evaluation of French Dependency Parsers on a Specialized Corpus: Comparison of Distributional Thesauri
Ludovic Tanguy | Pauline Brunet | Olivier Ferret
Proceedings of the 12th Language Resources and Evaluation Conference

We present a study in which we compare 11 different French dependency parsers on a specialized corpus (consisting of research articles on NLP from the proceedings of the TALN conference). Due to the lack of a suitable gold standard, we use each of the parsers’ output to generate distributional thesauri using a frequency-based method. We compare these 11 thesauri to assess the impact of choosing a parser over another. We show that, without any reference data, we can still identify relevant subsets among the different parsers. We also show that the similarity we identify between parsers is confirmed on a restricted distributional benchmark.

pdf bib
Collecting Tweets to Investigate Regional Variation in Canadian English
Filip Miletic | Anne Przewozny-Desriaux | Ludovic Tanguy
Proceedings of the 12th Language Resources and Evaluation Conference

We present a 78.8-million-tweet, 1.3-billion-word corpus aimed at studying regional variation in Canadian English with a specific focus on the dialect regions of Toronto, Montreal, and Vancouver. Our data collection and filtering pipeline reflects complex design criteria, which aim to allow for both data-intensive modeling methods and user-level variationist sociolinguistic analysis. It specifically consists in identifying Twitter users from the three cities, crawling their entire timelines, filtering the collected data in terms of user location and tweet language, and automatically excluding near-duplicate content. The resulting corpus mirrors national and regional specificities of Canadian English, it provides sufficient aggregate and user-level data, and it maintains a reasonably balanced distribution of content across regions and users. The utility of this dataset is illustrated by two example applications: the detection of regional lexical and topical variation, and the identification of contact-induced semantic shifts using vector space models. In accordance with Twitter’s developer policy, the corpus will be publicly released in the form of tweet IDs.

pdf bib
Impact de la structure logique des documents sur les modèles distributionnels : expérimentations sur le corpus TALN (Impact of document structure on distributional semantics models: a case study on NLP research articles )
Ludovic Tanguy | Cécile Fabre | Yoann Bard
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Nous présentons une expérience visant à mesurer en quoi la structure logique d’un document impacte les représentations lexicales dans les modèles de sémantique distributionnelle. En nous basant sur des documents structurés (articles de recherche en TAL) nous comparons des modèles construits sur des corpus obtenus par suppression de certaines parties des textes du corpus : titres de section, résumés, introductions et conclusions. Nous montrons que malgré des différences selon les parties et le lexique pris en compte, ces zones réputées particulièrement informatives du contenu d’un article ont un impact globalement moins significatif que le reste du texte sur la construction du modèle.

pdf bib
LITL at SMM4H: An Old-school Feature-based Classifier for Identifying Adverse Effects in Tweets
Ludovic Tanguy | Lydia-Mai Ho-Dac | Cécile Fabre | Roxane Bois | Touati Mohamed Yacine Haddad | Claire Ibarboure | Marie Joyau | François Le moal | Jade Moiilic | Laura Roudaut | Mathilde Simounet | Irena Stankovic | Mickaela Vandewaetere
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

This paper describes our participation to the SMM4H shared task 2. We designed a rule-based classifier that estimates whether a tweet mentions an adverse effect associated to a medication. Our system addresses English and French, and is based on a number of specific word lists and features. These cues were mostly obtained through an extensive corpus analysis of the provided training data. Different weighting schemes were tested (manually tuned or based on a logistic regression), the best one achieving a F1 score of 0.31 for English and 0.15 for French.

2019

pdf bib
Comparaison qualitative et extrinsèque d’analyseurs syntaxiques du français : confrontation de modèles distributionnels sur un corpus spécialisé (Extrinsic evaluation of French dependency parsers on a specialised corpus : comparison of distributional thesauri )
Ludovic Tanguy | Pauline Brunet | Olivier Ferret
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume I : Articles longs

Nous présentons une étude visant à comparer 11 différents analyseurs en dépendances du français sur un corpus spécialisé (constitué des archives des articles de la conférence TALN). En l’absence de gold standard, nous utilisons chacune des sorties de ces analyseurs pour construire des thésaurus distributionnels en utilisant une méthode à base de fréquence. Nous comparons ces 11 thésaurus afin de proposer un premier aperçu de l’impact du choix d’un analyseur par rapport à un autre.

pdf bib
Investigating the Stability of Concrete Nouns in Word Embeddings
Bénédicte Pierrejean | Ludovic Tanguy
Proceedings of the 13th International Conference on Computational Semantics - Short Papers

We know that word embeddings trained using neural-based methods (such as word2vec SGNS) are sensitive to stability problems and that across two models trained using the exact same set of parameters, the nearest neighbors of a word are likely to change. All words are not equally impacted by this internal instability and recent studies have investigated features influencing the stability of word embeddings. This stability can be seen as a clue for the reliability of the semantic representation of a word. In this work, we investigate the influence of the degree of concreteness of nouns on the stability of their semantic representation. We show that for English generic corpora, abstract words are more affected by stability problems than concrete words. We also found that to a certain extent, the difference between the degree of concreteness of a noun and its nearest neighbors can partly explain the stability or instability of its neighbors.

pdf bib
Toward a Computational Multidimensional Lexical Similarity Measure for Modeling Word Association Tasks in Psycholinguistics
Bruno Gaume | Lydia Mai Ho-Dac | Ludovic Tanguy | Cécile Fabre | Bénédicte Pierrejean | Nabil Hathout | Jérôme Farinas | Julien Pinquier | Lola Danet | Patrice Péran | Xavier De Boissezon | Mélanie Jucla
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

This paper presents the first results of a multidisciplinary project, the “Evolex” project, gathering researchers in Psycholinguistics, Neuropsychology, Computer Science, Natural Language Processing and Linguistics. The Evolex project aims at proposing a new data-based inductive method for automatically characterising the relation between pairs of french words collected in psycholinguistics experiments on lexical access. This method takes advantage of several complementary computational measures of semantic similarity. We show that some measures are more correlated than others with the frequency of lexical associations, and that they also differ in the way they capture different semantic relations. This allows us to consider building a multidimensional lexical similarity to automate the classification of lexical associations.

2018

pdf bib
Predicting Word Embeddings Variability
Bénédicte Pierrejean | Ludovic Tanguy
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics

Neural word embeddings models (such as those built with word2vec) are known to have stability problems: when retraining a model with the exact same hyperparameters, words neighborhoods may change. We propose a method to estimate such variation, based on the overlap of neighbors of a given word in two models trained with identical hyperparameters. We show that this inherent variation is not negligible, and that it does not affect every word in the same way. We examine the influence of several features that are intrinsic to a word, corpus or embedding model and provide a methodology that can predict the variability (and as such, reliability) of a word representation in a semantic vector space.

pdf bib
Extending the gold standard for a lexical substitution task: is it worth it?
Ludovic Tanguy | Cécile Fabre | Laura Rivière
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Towards Qualitative Word Embeddings Evaluation: Measuring Neighbors Variation
Bénédicte Pierrejean | Ludovic Tanguy
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

We propose a method to study the variation lying between different word embeddings models trained with different parameters. We explore the variation between models trained with only one varying parameter by observing the distributional neighbors variation and show how changing only one parameter can have a massive impact on a given semantic space. We show that the variation is not affecting all words of the semantic space equally. Variation is influenced by parameters such as setting a parameter to its minimum or maximum value but it also depends on the corpus intrinsic features such as the frequency of a word. We identify semantic classes of words remaining stable across the models trained and specific words having high variation.

pdf bib
Etude de la reproductibilité des word embeddings : repérage des zones stables et instables dans le lexique (Reproducibility of word embeddings : identifying stable and unstable zones in the semantic space)
Bénédicte Pierrejean | Ludovic Tanguy
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Les modèles vectoriels de sémantique distributionnelle (ou word embeddings), notamment ceux produits par les méthodes neuronales, posent des questions de reproductibilité et donnent des représentations différentes à chaque utilisation, même sans modifier leurs paramètres. Nous présentons ici un ensemble d’expérimentations permettant de mesurer cette instabilité, à la fois globalement et localement. Globalement, nous avons mesuré le taux de variation du voisinage des mots sur trois corpus différents, qui est estimé autour de 17% pour les 25 plus proches voisins d’un mot. Localement, nous avons identifié et caractérisé certaines zones de l’espace sémantique qui montrent une relative stabilité, ainsi que des cas de grande instabilité.

2016

pdf bib
Analyse d’une tâche de substitution lexicale : quelles sont les sources de difficulté ? (Difficulty analysis for a lexical substitution task)
Ludovic Tanguy | Cécile Fabre | Camille Mercier
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Nous proposons dans cet article une analyse des résultats de la campagne SemDis 2014 qui proposait une tâche de substitution lexicale en français. Pour les 300 phrases du jeu de test, des annotateurs ont proposé des substituts à un mot cible, permettant ainsi d’établir un gold standard sur lequel les systèmes participants ont été évalués. Nous cherchons à identifier les principales caractéristiques des items du jeu de test qui peuvent expliquer les variations de performance pour les humains comme pour les systèmes, en nous basant sur l’accord inter-annotateurs des premiers et les scores de rappel des seconds. Nous montrons que si plusieurs caractéristiques communes sont associées aux deux types de difficulté (rareté du sens dans lequel le mot-cible est employé, fréquence d’emploi du mot-cible), d’autres sont spécifiques aux systèmes (degré de polysémie du mot-cible, complexité syntaxique).

2014

pdf bib
TALN-RECITAL 2014 Workshop SemDis 2014 : Enjeux actuels de la sémantique distributionnelle (SemDis 2014: Current Challenges in Distributional Semantics)
Cécile Fabre | Nabil Hathout | Lydia-Mai Ho-Dac | François Morlane-Hondère | Philippe Muller | Franck Sajous | Ludovic Tanguy | Tim Van de Cruys
TALN-RECITAL 2014 Workshop SemDis 2014 : Enjeux actuels de la sémantique distributionnelle (SemDis 2014: Current Challenges in Distributional Semantics)

pdf bib
Presentation of the SemDis 2014 workshop: distributional semantics for two tasks - lexical substitution and exploration of specialized corpora (Présentation de l’atelier SemDis 2014 : sémantique distributionnelle pour la substitution lexicale et l’exploration de corpus spécialisés) [in French]
Cécile Fabre | Nabil Hathout | Lydia-Mai Ho-Dac | François Morlane-Hondère | Philippe Muller | Franck Sajous | Ludovic Tanguy | Tim Van de Cruys
TALN-RECITAL 2014 Workshop SemDis 2014 : Enjeux actuels de la sémantique distributionnelle (SemDis 2014: Current Challenges in Distributional Semantics)

pdf bib
Tuning distributional analysis for a small specialized corpus (Ajuster l’analyse distributionnelle à un corpus spécialisé de petite taille) [in French]
Cécile Fabre | Nabil Hathout | Franck Sajous | Ludovic Tanguy
TALN-RECITAL 2014 Workshop SemDis 2014 : Enjeux actuels de la sémantique distributionnelle (SemDis 2014: Current Challenges in Distributional Semantics)

2013

pdf bib
APPLYING A BEAM SEARCH TO TRANSITION-BASED DEPENDENCY PARSING: A CASE STUDY FOR FRENCH WITH THE TALISMANE SUITE (L’apport du faisceau dans l’analyse syntaxique en dépendances par transitions : études de cas avec l’analyseur Talismane) [in French]
Assaf Urieli | Ludovic Tanguy
Proceedings of TALN 2013 (Volume 1: Long Papers)

pdf bib
Second order similarity for exploring multilingual textual databases (Similarité de second ordre pour l’exploration de bases textuelles multilingues) [in French]
Nikola Tulechki | Ludovic Tanguy
Proceedings of TALN 2013 (Volume 2: Short Papers)

2012

pdf bib
Effacement de dimensions de similarité textuelle pour l’exploration de collections de rapports d’incidents aéronautiques (Deletion of dimensions of textual similarity for the exploration of collections of accident reports in aviation) [in French]
Nikola Tulechki | Ludovic Tanguy
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
An empirical resource for discovering cognitive principles of discourse organisation: the ANNODIS corpus
Stergos Afantenos | Nicholas Asher | Farah Benamara | Myriam Bras | Cécile Fabre | Mai Ho-dac | Anne Le Draoulec | Philippe Muller | Marie-Paule Péry-Woodley | Laurent Prévot | Josette Rebeyrolles | Ludovic Tanguy | Marianne Vergez-Couret | Laure Vieu
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the ANNODIS resource, a discourse-level annotated corpus for French. The corpus combines two perspectives on discourse: a bottom-up approach and a top-down approach. The bottom-up view incrementally builds a structure from elementary discourse units, while the top-down view focuses on the selective annotation of multi-level discourse structures. The corpus is composed of texts that are diversified with respect to genre, length and type of discursive organisation. The methodology followed here involves an iterative design of annotation guidelines in order to reach satisfactory inter-annotator agreement levels. This allows us to raise a few issues relevant for the comparison of such complex objects as discourse structures. The corpus also serves as a source of empirical evidence for discourse theories. We present here two first analyses taking advantage of this new annotated corpus --one that tested hypotheses on constraints governing discourse structure, and another that studied the variations in composition and signalling of multi-level discourse structures.

2002

pdf bib
Webaffix: Discovering Morphological Links on the WWW
Nabil Hathout | Ludovic Tanguy
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)