Béatrice Daille

Also published as: Beatrice Daille


2020

pdf bib
Proceedings of the 6th International Workshop on Computational Terminology
Béatrice Daille | Kyo Kageura | Ayla Rigouts Terryn
Proceedings of the 6th International Workshop on Computational Terminology

pdf bib
A study of semantic projection from single word terms to multi-word terms in the environment domain
Yizhe Wang | Beatrice Daille | Nabil Hathout
Proceedings of the 6th International Workshop on Computational Terminology

The semantic projection method is often used in terminology structuring to infer semantic relations between terms. Semantic projection relies upon the assumption of semantic compositionality: the relation that links simple term pairs remains valid in pairs of complex terms built from these simple terms. This paper proposes to investigate whether this assumption commonly adopted in natural language processing is actually valid. First, we describe the process of constructing a list of semantically linked multi-word terms (MWTs) related to the environmental field through the extraction of semantic variants. Second, we present our analysis of the results from the semantic projection. We find that contexts play an essential role in defining the relations between MWTs.

pdf bib
Towards Automatic Thesaurus Construction and Enrichment.
Amir Hazem | Beatrice Daille | Lanza Claudia
Proceedings of the 6th International Workshop on Computational Terminology

Thesaurus construction with minimum human efforts often relies on automatic methods to discover terms and their relations. Hence, the quality of a thesaurus heavily depends on the chosen methodologies for: (i) building its content (terminology extraction task) and (ii) designing its structure (semantic similarity task). The performance of the existing methods on automatic thesaurus construction is still less accurate than the handcrafted ones of which is important to highlight the drawbacks to let new strategies build more accurate thesauri models. In this paper, we will provide a systematic analysis of existing methods for both tasks and discuss their feasibility based on an Italian Cybersecurity corpus. In particular, we will provide a detailed analysis on how the semantic relationships network of a thesaurus can be automatically built, and investigate the ways to enrich the terminological scope of a thesaurus by taking into account the information contained in external domain-oriented semantic sets.

pdf bib
TermEval 2020: TALN-LS2N System for Automatic Term Extraction
Amir Hazem | Mérieme Bouhandi | Florian Boudin | Beatrice Daille
Proceedings of the 6th International Workshop on Computational Terminology

Automatic terminology extraction is a notoriously difficult task aiming to ease effort demanded to manually identify terms in domain-specific corpora by automatically providing a ranked list of candidate terms. The main ways that addressed this task can be ranged in four main categories: (i) rule-based approaches, (ii) feature-based approaches, (iii) context-based approaches, and (iv) hybrid approaches. For this first TermEval shared task, we explore a feature-based approach, and a deep neural network multitask approach -BERT- that we fine-tune for term extraction. We show that BERT models (RoBERTa for English and CamemBERT for French) outperform other systems for French and English languages.

pdf bib
Books of Hours. the First Liturgical Data Set for Text Segmentation.
Amir Hazem | Beatrice Daille | Christopher Kermorvant | Dominique Stutzmann | Marie-Laurence Bonhomme | Martin Maarand | Mélodie Boillet
Proceedings of the 12th Language Resources and Evaluation Conference

The Book of Hours was the bestseller of the late Middle Ages and Renaissance. It is a historical invaluable treasure, documenting the devotional practices of Christians in the late Middle Ages. Up to now, its textual content has been scarcely studied because of its manuscript nature, its length and its complex content. At first glance, it looks too standardized. However, the study of book of hours raises important challenges: (i) in image analysis, its often lavish ornamentation (illegible painted initials, line-fillers, etc.), abbreviated words, multilingualism are difficult to address in Handwritten Text Recognition (HTR); (ii) its hierarchical entangled structure offers a new field of investigation for text segmentation; (iii) in digital humanities, its textual content gives opportunities for historical analysis. In this paper, we provide the first corpus of books of hours, which consists of Latin transcriptions of 300 books of hours generated by Handwritten Text Recognition (HTR) - that is like Optical Character Recognition (OCR) but for handwritten and not printed texts. We designed a structural scheme of the book of hours and annotated manually two books of hours according to this scheme. Lastly, we performed a systematic evaluation of the main state of the art text segmentation approaches.

pdf bib
Hierarchical Text Segmentation for Medieval Manuscripts
Amir Hazem | Beatrice Daille | Dominique Stutzmann | Christopher Kermorvant | Louis Chevalier
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we address the segmentation of books of hours, Latin devotional manuscripts of the late Middle Ages, that exhibit challenging issues: a complex hierarchical entangled structure, variable content, noisy transcriptions with no sentence markers, and strong correlations between sections for which topical information is no longer sufficient to draw segmentation boundaries. We show that the main state-of-the-art segmentation methods are either inefficient or inapplicable for books of hours and propose a bottom-up greedy approach that considerably enhances the segmentation results. We stress the importance of such hierarchical segmentation of books of hours for historians to explore their overarching differences underlying conception about Church.

2019

pdf bib
Réutilisation de Textes dans les Manuscrits Anciens (Text Reuse in Ancient Manuscripts)
Amir Hazem | Béatrice Daille | Dominique Stutzmann | Jacob Currie | Christine Jacquin
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

Nous nous intéressons dans cet article à la problématique de réutilisation de textes dans les livres liturgiques du Moyen Âge. Plus particulièrement, nous étudions les variations textuelles de la prière Obsecro Te souvent présente dans les livres d’heures. L’observation manuelle de 772 copies de l’Obsecro Te a montré l’existence de plus de 21 000 variantes textuelles. Dans le but de pouvoir les extraire automatiquement et les catégoriser, nous proposons dans un premier temps une classification lexico-sémantique au niveau n-grammes de mots pour ensuite rendre compte des performances de plusieurs approches état-de-l’art d’appariement automatique de variantes textuelles de l’Obsecro Te.

pdf bib
Terminology systematization for Cybersecurity domain in Italian Language
Claudia Lanza | Béatrice Daille
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Terminologie et Intelligence Artificielle (atelier TALN-RECITAL \& IC)

This paper aims at presenting the first steps to improve the quality of the first draft of an Italian thesaurus for Cybersecurity terminology that has been realized for a specific project activity in collaboration with CybersecurityLab at Informatics and Telematics Institute (IIT) of the National Council of Research (CNR) in Italy. In particular, the paper will focus, first, on the terminological knowledge base built to retrieve the most representative candidate terms of Cybersecurity domain in Italian language, giving examples of the main gold standard repositories that have been used to build this semantic tool. Attention will be then given to the methodology and software employed to configure a system of NLP rules to get the desired semantic results and to proceed with the enhancement of the candidate terms selection which are meant to be inserted in the controlled vocabulary.

pdf bib
Towards Automatic Variant Analysis of Ancient Devotional Texts
Amir Hazem | Béatrice Daille | Dominique Stutzmann | Jacob Currie | Christine Jacquin
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We address in this paper the issue of text reuse in liturgical manuscripts of the middle ages. More specifically, we study variant readings of the Obsecro Te prayer, part of the devotional Books of Hours often used by Christians as guidance for their daily prayers. We aim at automatically extracting and categorising pairs of words and expressions that exhibit variant relations. For this purpose, we adopt a linguistic classification that allows to better characterize the variants than edit operations. Then, we study the evolution of Obsecro Te texts from a temporal and geographical axis. Finally, we contrast several unsupervised state-of-the-art approaches for the automatic extraction of Obsecro Te variants. Based on the manual observation of 772 Obsecro Te copies which show more than 21,000 variants, we show that the proposed methodology is helpful for an automatic study of variants and may serve as basis to analyze and to depict useful information from devotional texts.

pdf bib
KPTimes: A Large-Scale Dataset for Keyphrase Generation on News Documents
Ygor Gallina | Florian Boudin | Beatrice Daille
Proceedings of the 12th International Conference on Natural Language Generation

Keyphrase generation is the task of predicting a set of lexical units that conveys the main content of a source text. Existing datasets for keyphrase generation are only readily available for the scholarly domain and include non-expert annotations. In this paper we present KPTimes, a large-scale dataset of news texts paired with editor-curated keyphrases. Exploring the dataset, we show how editors tag documents, and how their annotations differ from those found in existing datasets. We also train and evaluate state-of-the-art neural keyphrase generation models on KPTimes to gain insights on how well they perform on the news domain. The dataset is available online at https:// github.com/ygorg/KPTimes.

2018

pdf bib
Word Embedding Approach for Synonym Extraction of Multi-Word Terms
Amir Hazem | Béatrice Daille
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Towards a Diagnosis of Textual Difficulties for Children with Dyslexia
Solen Quiniou | Béatrice Daille
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
Evaluating Lexical Similarity to build Sentiment Similarity
Grégoire Jadi | Vincent Claveau | Béatrice Daille | Laura Monceaux
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this article, we propose to evaluate the lexical similarity information provided by word representations against several opinion resources using traditional Information Retrieval tools. Word representation have been used to build and to extend opinion resources such as lexicon, and ontology and their performance have been evaluated on sentiment analysis tasks. We question this method by measuring the correlation between the sentiment proximity provided by opinion resources and the semantic similarity provided by word representations using different correlation coefficients. We also compare the neighbors found in word representations and list of similar opinion words. Our results show that the proximity of words in state-of-the-art word representations is not very effective to build sentiment similarity.

pdf bib
TermITH-Eval: a French Standard-Based Resource for Keyphrase Extraction Evaluation
Adrien Bougouin | Sabine Barreaux | Laurent Romary | Florian Boudin | Béatrice Daille
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Keyphrase extraction is the task of finding phrases that represent the important content of a document. The main aim of keyphrase extraction is to propose textual units that represent the most important topics developed in a document. The output keyphrases of automatic keyphrase extraction methods for test documents are typically evaluated by comparing them to manually assigned reference keyphrases. Each output keyphrase is considered correct if it matches one of the reference keyphrases. However, the choice of the appropriate textual unit (keyphrase) for a topic is sometimes subjective and evaluating by exact matching underestimates the performance. This paper presents a dataset of evaluation scores assigned to automatically extracted keyphrases by human evaluators. Along with the reference keyphrases, the manual evaluations can be used to validate new evaluation measures. Indeed, an evaluation measure that is highly correlated to the manual evaluation is appropriate for the evaluation of automatic keyphrase extraction methods.

pdf bib
Bilingual Lexicon Extraction at the Morpheme Level Using Distributional Analysis
Amir Hazem | Béatrice Daille
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Bilingual lexicon extraction from comparable corpora is usually based on distributional methods when dealing with single word terms (SWT). These methods often treat SWT as single tokens without considering their compositional property. However, many SWT are compositional (composed of roots and affixes) and this information, if taken into account can be very useful to match translational pairs, especially for infrequent terms where distributional methods often fail. For instance, the English compound xenograft which is composed of the root xeno and the lexeme graft can be translated into French compositionally by aligning each of its elements (xeno with xéno and graft with greffe) resulting in the translation: xénogreffe. In this paper, we experiment several distributional modellings at the morpheme level that we apply to perform compositional translation to a subset of French and English compounds. We show promising results using distributional analysis at the root and affix levels. We also show that the adapted approach significantly improve bilingual lexicon extraction from comparable corpora compared to the approach at the word level.

pdf bib
Ambiguity Diagnosis for Terms in Digital Humanities
Béatrice Daille | Evelyne Jacquey | Gaël Lejeune | Luis Felipe Melo | Yannick Toussaint
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Among all researches dedicating to terminology and word sense disambiguation, little attention has been devoted to the ambiguity of term occurrences. If a lexical unit is indeed a term of the domain, it is not true, even in a specialised corpus, that all its occurrences are terminological. Some occurrences are terminological and other are not. Thus, a global decision at the corpus level about the terminological status of all occurrences of a lexical unit would then be erroneous. In this paper, we propose three original methods to characterise the ambiguity of term occurrences in the domain of social sciences for French. These methods differently model the context of the term occurrences: one is relying on text mining, the second is based on textometry, and the last one focuses on text genre properties. The experimental results show the potential of the proposed approaches and give an opportunity to discuss about their hybridisation.

pdf bib
Keyphrase Annotation with Graph Co-Ranking
Adrien Bougouin | Florian Boudin | Béatrice Daille
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Keyphrase annotation is the task of identifying textual units that represent the main content of a document. Keyphrase annotation is either carried out by extracting the most important phrases from a document, keyphrase extraction, or by assigning entries from a controlled domain-specific vocabulary, keyphrase assignment. Assignment methods are generally more reliable. They provide better-formed keyphrases, as well as keyphrases that do not occur in the document. But they are often silent on the contrary of extraction methods that do not depend on manually built resources. This paper proposes a new method to perform both keyphrase extraction and keyphrase assignment in an integrated and mutual reinforcing manner. Experiments have been carried out on datasets covering different domains of humanities and social sciences. They show statistically significant improvements compared to both keyphrase extraction and keyphrase assignment state-of-the art methods.

pdf bib
Terminology Extraction with Term Variant Detection
Damien Cram | Béatrice Daille
Proceedings of ACL-2016 System Demonstrations

pdf bib
Extraction de lexiques bilingues à partir de corpus comparables spécialisés à travers une langue pivot (Bilingual lexicon extraction from specialized comparable corpora using a pivot language)
Alexis Linard | Emmanuel Morin | Béatrice Daille
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

L’extraction de lexiques bilingues à partir de corpus comparables se réalise traditionnellement en s’appuyant sur deux langues. Des travaux précédents en extraction de lexiques bilingues à partir de corpus parallèles ont démontré que l’utilisation de plus de deux langues peut être utile pour améliorer la qualité des alignements extraits. Nos travaux montrent qu’il est possible d’utiliser la même stratégie pour des corpus comparables. Nous avons défini deux méthodes originales impliquant des langues pivots et nous les avons évaluées sur quatre langues et deux langues pivots en particulier. Nos expérimentations ont montré que lorsque l’alignement entre la langue source et la langue pivot est de bonne qualité, l’extraction du lexique en langue cible s’en trouve améliorée.

pdf bib
Modélisation unifiée du document et de son domaine pour une indexation par termes-clés libre et contrôlée (Unified document and domain-specific model for keyphrase extraction and assignment )
Adrien Bougouin | Florian Boudin | Beatrice Daille
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Dans cet article, nous nous intéressons à l’indexation de documents de domaines de spécialité par l’intermédiaire de leurs termes-clés. Plus particulièrement, nous nous intéressons à l’indexation telle qu’elle est réalisée par les documentalistes de bibliothèques numériques. Après analyse de la méthodologie de ces indexeurs professionnels, nous proposons une méthode à base de graphe combinant les informations présentes dans le document et la connaissance du domaine pour réaliser une indexation (hybride) libre et contrôlée. Notre méthode permet de proposer des termes-clés ne se trouvant pas nécessairement dans le document. Nos expériences montrent aussi que notre méthode surpasse significativement l’approche à base de graphe état de l’art.

pdf bib
Extraction d’expressions-cibles de l’opinion : de l’anglais au français (Opinion Target Expression extraction : from English to French)
Grégoire Jadi | Laura Monceaux | Vincent Claveau | Béatrice Daille
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Dans cet article, nous présentons le développement d’un système d’extraction d’expressions-cibles pour l’anglais et sa transposition au français. En complément, nous avons réalisé une étude de l’efficacité des traits en anglais et en français qui tend à montrer qu’il est possible de réaliser un système d’extraction d’expressions-cibles indépendant du domaine. Pour finir, nous proposons une analyse comparative des erreurs commises par nos systèmes en anglais et français et envisageons différentes solutions à ces problèmes.

pdf bib
Segmentation automatique d’un texte en rhèses (Automatic segmentation of a text into rhesis)
Victor Pineau | Constance Nin | Solen Quiniou | Béatrice Daille
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

La segmentation d’un texte en rhèses, unités-membres signifiantes de la phrase, permet de fournir des adaptations de celui-ci pour faciliter la lecture aux personnes dyslexiques. Dans cet article, nous proposons une méthode d’identification automatique des rhèses basée sur un apprentissage supervisé à partir d’un corpus que nous avons annoté. Nous comparons celle-ci à l’identification manuelle ainsi qu’à l’utilisation d’outils et de concepts proches, tels que la segmentation d’un texte en chunks.

2015

pdf bib
Extraction de Contextes Riches en Connaissances en corpus spécialisés
Firas Hmida | Emmanuel Morin | Béatrice Daille
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Les banques terminologiques et les dictionnaires sont des ressources précieuses qui facilitent l’accès aux connaissances des domaines spécialisés. Ces ressources sont souvent assez pauvres et ne proposent pas toujours pour un terme à illustrer des exemples permettant d’appréhender le sens et l’usage de ce terme. Dans ce contexte, nous proposons de mettre en œuvre la notion de Contextes Riches en Connaissances (CRC) pour extraire directement de corpus spécialisés des exemples de contextes illustrant son usage. Nous définissons un cadre unifié pour exploiter tout à la fois des patrons de connaissances et des collocations avec une qualité acceptable pour une révision humaine.

pdf bib
Vers un diagnostic d’ambiguïté des termes candidats d’un texte
Gaël Lejeune | Béatrice Daille
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Les recherches autour de la désambiguïsation sémantique traitent de la question du sens à accorder à différentes occurrences d’un mot ou plus largement d’une unité lexicale. Dans cet article, nous nous intéressons à l’ambiguïté d’un terme en domaine de spécialité. Nous posons les premiers jalons de nos recherches sur une question connexe que nous nommons le diagnostic d’ambiguïté. Cette tâche consiste à décider si une occurrence d’un terme est ou n’est pas ambiguë. Nous mettons en œuvre une approche d’apprentissage supervisée qui exploite un corpus d’articles de sciences humaines rédigés en français dans lequel les termes ambigus ont été détectés par des experts. Le diagnostic s’appuie sur deux types de traits : syntaxiques et positionnels. Nous montrons l’intérêt de la structuration du texte pour établir le diagnostic d’ambiguïté.

pdf bib
Attempting to Bypass Alignment from Comparable Corpora via Pivot Language
Alexis Linard | Béatrice Daille | Emmanuel Morin
Proceedings of the Eighth Workshop on Building and Using Comparable Corpora

2014

pdf bib
Splitting of Compound Terms in non-Prototypical Compounding Languages
Elizaveta Clouet | Béatrice Daille
Proceedings of the First Workshop on Computational Approaches to Compound Analysis (ComAComA 2014)

pdf bib
The impact of domains for Keyphrase extraction (Influence des domaines de spécialité dans l’extraction de termes-clés) [in French]
Adrien Bougouin | Florian Boudin | Béatrice Daille
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Semi-compositional Method for Synonym Extraction of Multi-Word Terms
Béatrice Daille | Amir Hazem
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Automatic synonyms and semantically related word extraction is a challenging task, useful in many NLP applications such as question answering, search query expansion, text summarization, etc. While different studies addressed the task of word synonym extraction, only a few investigations tackled the problem of acquiring synonyms of multi-word terms (MWT) from specialized corpora. To extract pairs of synonyms of multi-word terms, we propose in this paper an unsupervised semi-compositional method that makes use of distributional semantics and exploit the compositional property shared by most MWT. We show that our method outperforms significantly the state-of-the-art.

2013

pdf bib
Ranking Translation Candidates Acquired from Comparable Corpora
Rima Harastani | Béatrice Daille | Emmanuel Morin
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction
Adrien Bougouin | Florian Boudin | Béatrice Daille
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Identification, Alignment, and Tranlsation of Relational Adjectives from Comparable Corpora (Identification, alignement, et traductions des adjectifs relationnels en corpus comparables) [in French]
Rima Harastani | Beatrice Daille | Emmanuel Morin
Proceedings of TALN 2013 (Volume 1: Long Papers)

pdf bib
Multilingual Compound Splitting (Segmentation Multilingue des Mots Composés) [in French]
Elizaveta Loginova-Clouet | Béatrice Daille
Proceedings of TALN 2013 (Volume 2: Short Papers)

pdf bib
Apopsis Demonstrator for Tweet Analysis (Démonstrateur Apopsis pour l’analyse des tweets) [in French]
Sebastián Peña Saldarriaga | Damien Vintache | Béatrice Daille
Proceedings of TALN 2013 (Volume 3: System Demonstrations)

pdf bib
TTC TermSuite - Terminological Alignment from Comparable Corpora (TTC TermSuite alignement terminologique à partir de corpus comparables) [in French]
Béatrice Daille | Rima Harastani
Proceedings of TALN 2013 (Volume 3: System Demonstrations)

2012

pdf bib
Compositionnalité et contextes issus de corpus comparables pour la traduction terminologique (Compositionality and Context for Bilingual Lexicon Extraction from Comparable Corpora) [in French]
Emmanuel Morin | Béatrice Daille
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Extraction of Domain-Specific Bilingual Lexicon from Comparable Corpora: Compositional Translation and Ranking
Estelle Delpech | Béatrice Daille | Emmanuel Morin | Claire Lemaire
Proceedings of COLING 2012

pdf bib
Revising the Compositional Method for Terminology Acquisition from Comparable Corpora
Emmanuel Morin | Béatrice Daille
Proceedings of COLING 2012

2011

pdf bib
Reduction of Search Space to Annotate Monolingual Corpora
Prajol Shrestha | Christine Jacquin | Beatrice Daille
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
TTC TermSuite - A UIMA Application for Multilingual Terminology Extraction from Comparable Corpora
Jérôme Rocheteau | Béatrice Daille
Proceedings of the IJCNLP 2011 System Demonstrations

2010

pdf bib
UNPMC: Naive Approach to Extract Keyphrases from Scientific Articles
Jungyeul Park | Jong Gun Lee | Béatrice Daille
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Learning Subjectivity Phrases missing from Resources through a Large Set of Semantic Tests
Matthieu Vernier | Laura Monceaux | Béatrice Daille
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In recent years, blogs and social networks have particularly boosted interests for opinion mining research. In order to satisfy real-scale applicative needs, a main task is to create or to enhance lexical and semantic resources on evaluative language. Classical resources of the area are mostly built for english, they contain simple opinion word markers and are far to cover the lexical richness of this linguistic phenomenon. In particular, infrequent subjective words, idiomatic expressions, and cultural stereotypes are missing from resources. We propose a new method, applied on french, to enhance automatically an opinion word lexicon. This learning method relies on linguistic uses of internet users and on semantic tests to infer the degree of subjectivity of many new adjectives, nouns, verbs, noun phrases, verbal phrases which are usually forgotten by other resources. The final appraisal lexicon contains 3,456 entries. We evaluate the lexicon enhancement with and without textual context.

2009

pdf bib
Compilation of Specialized Comparable Corpora in French and Japanese
Lorraine Goeuriot | Emmanuel Morin | Béatrice Daille
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

2008

pdf bib
A Multi-Word Term Extraction Program for Arabic Language
Siham Boulaknadel | Beatrice Daille | Driss Aboutajdine
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Terminology extraction commonly includes two steps: identification of term-like units in the texts, mostly multi-word phrases, and the ranking of the extracted term-like units according to their domain representativity. In this paper, we design a multi-word term extraction program for Arabic language. The linguistic filtering performs a morphosyntactic analysis and takes into account several types of variations. The domain representativity is measure thanks to statistical scores. We evalutate several association measures and show that the results we otained are consitent with those obtained for Romance languages.

pdf bib
Characterization of Scientific and Popular Science Discourse in French, Japanese and Russian
Lorraine Goeuriot | Natalia Grabar | Béatrice Daille
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We aim to characterize the comparability of corpora, we address this issue in the trilingual context through the distinction of expert and non expert documents. We work separately with corpora composed of documents from the medical domain in three languages (French, Japanese and Russian) which present an important linguistic distance between them. In our approach, documents are characterized in each language by their topic and by a discursive typology positioned at three levels of document analysis: structural, modal and lexical. The document typology is implemented with two learning algorithms (SVMlight and C4.5). Evaluation of results shows that the proposed discursive typology can be transposed from one language to another, as it indeed allows to distinguish the two aimed discourses (science and popular science). However, we observe that performances vary a lot according to languages, algorithms and types of discursive characteristics.

pdf bib
An Effective Compositional Model for Lexical Alignment
Béatrice Daille | Emmanuel Morin
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

2007

pdf bib
Bilingual Terminology Mining - Using Brain, not brawn comparable corpora
Emmanuel Morin | Béatrice Daille | Koichi Takeuchi | Kyo Kageura
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2005

pdf bib
French-English Terminology Extraction from Comparable Corpora
Béatrice Daille | Emmanuel Morin
Second International Joint Conference on Natural Language Processing: Full Papers

2004

pdf bib
French-English Multi-word Term Alignment Based on Lexical Context Analysis
Béatrice Daille | Samuel Dufour-Kowalski | Emmanuel Morin
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Construction of Grammar Based Term Extraction Model for Japanese
Koichi Takeuchi | Kyo Kageura | Béatrice Daille | Laurent Romary
Proceedings of CompuTerm 2004: 3rd International Workshop on Computational Terminology

2003

pdf bib
Conceptual Structuring through Term Variations
Béatrice Daille
Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment

2002

pdf bib
Extracting French-Japanese Word Pairs from Bilingual Corpora based on Transliteration Rules
Keita Tsuji | Beatrice Daille | Kyo Kageura
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Incremental Recognition and Referential Categorization of French Proper Names
Nordine Fourour | Emmanuel Morin | Béatrice Daille
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

pdf bib
Morphological Rule Induction for Terminology Acquisition
Beatrice Daille
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

1994

pdf bib
Towards Automatic Extraction of Monolingual and Bilingual Terminology
Beatrice Daille | Eric Gaussier | Jean-Marc Lange
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

pdf bib
Study and Implementation of Combined Techniques for Automatic Extraction of Terminology
Beatrice Daille
The Balancing Act: Combining Symbolic and Statistical Approaches to Language