Bruno Cartoni


2020

pdf bib
Introducing Lexical Masks: a New Representation of Lexical Entries for Better Evaluation and Exchange of Lexicons
Bruno Cartoni | Daniel Calvelo Aros | Denny Vrandecic | Saran Lertpradit
Proceedings of the 12th Language Resources and Evaluation Conference

The evaluation and exchange of large lexicon databases remains a challenge in many NLP applications. Despite the existence of commonly accepted standards for the format and the features used in a lexicon, there is still a lack of precise and interoperable specification requirements about how lexical entries of a particular language should look like, both in terms of the numbers of forms and in terms of features associated with these forms. This paper presents the notion of “lexical masks”, a powerful tool used to evaluate and exchange lexicon databases in many languages.

2014

pdf bib
Theoretical and Computational Morphology: New Trends and Synergies
Bruno Cartoni | Delphine Bernhard | Delphine Tribout
Linguistic Issues in Language Technology, Volume 11, 2014 - Theoretical and Computational Morphology: New Trends and Synergies

pdf bib
A Database for Measuring Linguistic Information Content
Richard Sproat | Bruno Cartoni | HyunJeong Choe | David Huynh | Linne Ha | Ravindran Rajakumar | Evelyn Wenzel-Grondie
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Which languages convey the most information in a given amount of space? This is a question often asked of linguists, especially by engineers who often have some information theoretic measure of “information” in mind, but rarely define exactly how they would measure that information. The question is, in fact remarkably hard to answer, and many linguists consider it unanswerable. But it is a question that seems as if it ought to have an answer. If one had a database of close translations between a set of typologically diverse languages, with detailed marking of morphosyntactic and morphosemantic features, one could hope to quantify the differences between how these different languages convey information. Since no appropriate database exists we decided to construct one. The purpose of this paper is to present our work on the database, along with some preliminary results. We plan to release the dataset once complete.

2012

pdf bib
Extracting Directional and Comparable Corpora from a Multilingual Corpus for Translation Studies
Bruno Cartoni | Thomas Meyer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Translation studies rely more and more on corpus data to examine specificities of translated texts, that can be translated from different original languages and compared to original texts. In parallel, more and more multilingual corpora are becoming available for various natural language processing tasks. This paper questions the use of these multilingual corpora in translation studies and shows the methodological steps needed in order to obtain more reliably comparable sub-corpora that consist of original and directly translated text only. Various experiments are presented that show the advantage of directional sub-corpora.

pdf bib
Discourse-level Annotation over Europarl for Machine Translation: Connectives and Pronouns
Andrei Popescu-Belis | Thomas Meyer | Jeevanthi Liyanapathirana | Bruno Cartoni | Sandrine Zufferey
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes methods and results for the annotation of two discourse-level phenomena, connectives and pronouns, over a multilingual parallel corpus. Excerpts from Europarl in English and French have been annotated with disambiguation information for connectives and pronouns, for about 3600 tokens. This data is then used in several ways: for cross-linguistic studies, for training automatic disambiguation software, and ultimately for training and testing discourse-aware statistical machine translation systems. The paper presents the annotation procedures and their results in detail, and overviews the first systems trained on the annotated resources and their use for machine translation.

pdf bib
The Trilingual ALLEGRA Corpus: Presentation and Possible Use for Lexicon Induction
Yves Scherrer | Bruno Cartoni
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present a trilingual parallel corpus for German, Italian and Romansh, a Swiss minority language spoken in the canton of Grisons. The corpus called ALLEGRA contains press releases automatically gathered from the website of the cantonal administration of Grisons. Texts have been preprocessed and aligned with a current state-of-the-art sentence aligner. The corpus is one of the first of its kind, and can be of great interest, particularly for the creation of natural language processing resources and tools for Romansh. We illustrate the use of such a trilingual resource for automatic induction of bilingual lexicons, which is a real challenge for under-represented languages. We induce a bilingual lexicon for German-Romansh by phrase alignment and evaluate the resulting entries with the help of a reference lexicon. We then show that the use of the third language of the corpus ― Italian ― as a pivot language can improve the precision of the induced lexicon, without loss in terms of quality of the extracted pairs.

2011

pdf bib
How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives
Bruno Cartoni | Sandrine Zufferey | Thomas Meyer | Andrei Popescu-Belis
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

pdf bib
Multilingual Annotation and Disambiguation of Discourse Connectives for Machine Translation
Thomas Meyer | Andrei Popescu-Belis | Sandrine Zufferey | Bruno Cartoni
Proceedings of the SIGDIAL 2011 Conference

2010

pdf bib
The MuLeXFoR Database: Representing Word-Formation Processes in a Multilingual Lexicographic Environment
Bruno Cartoni | Marie-Aude Lefer
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper introduces a new lexicographic resource, the MuLeXFoR database, which aims to present word-formation processes in a multilingual environment. Morphological items represent a real challenge for lexicography, especially for the development of multilingual tools. Affixes can take part in several word-formation rules and, conversely, rules can be realised by means of a variety of affixes. Consequently, it is often difficult to provide enough information to help users understand the meaning(s) of an affix or familiarise with the most frequent strategies used to translate the meaning(s) conveyed by affixes. In fact, traditional dictionaries often fail to achieve this goal. The database introduced in this paper tries to take advantage of recent advances in electronic implementation and morphological theory. Word-formation is presented as a set of multilingual rules that users can access via different indexes (affixes, rules and constructed words). MuLeXFoR entries contain, among other things, detailed descriptions of morphological constraints and productivity notes, which are sorely lacking in currently available tools such as bilingual dictionaries.

pdf bib
Semi-Automated Extension of a Specialized Medical Lexicon for French
Bruno Cartoni | Pierre Zweigenbaum
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes the development of a specialized lexical resource for a specialized domain, namely medicine. First, in order to assess the linguistic phenomena that need to be adressed, we based our observation on a large collection of more than 300'000 terms, organised around conceptual identifiers. Based on these observations, we highlight the specificities that such a lexicon should take into account, namely in terms of inflectional and derivational knowledge. In a first experiment, we show that general resources lack a large part of the words needed to process specialized language. Secondly, we describe an experiment to feed semi-automatically a medical lexicon and populate it with inflectional information. This experiment is based on a semi-automatic methods that tries to acquire inflectional knowledge from frequent endings of words recorded in existing lexicon. Thanks to this, we increased the coverage of the target vocabulary from 14.1% to 25.7%.

2009

pdf bib
Lexical Morphology in Machine Translation: A Feasibility Study
Bruno Cartoni
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

pdf bib
Lexical Resources for Automatic Translation of Constructed Neologisms: the Case Study of Relational Adjectives
Bruno Cartoni
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper deals with the treatment of constructed neologisms in a machine translation system. It focuses on a particular issue in Romance languages: relational adjectives and the role they play in prefixation. Relational adjectives are formally adjectives but are semantically linked to their base-noun. In prefixation processes, the prefix is formally attached to the adjective, but its semantic value(s) is applied to the semantic features of the base-noun. This phenomenon has to be taken into account by any morphological analyser or generator. Moreover, in a contrastive perspective, the possibilities of creating adjectives out of nouns are not the same in every language. We present the special mechanism we put in place to deal with this type of prefixation, and the automatic method we used to extend lexicons, so that they can retrieve the base-nouns of prefixed relational adjectives, and improve the translation quality.

2006

pdf bib
Dealing with unknown words by simple decomposition: feasibility studies with Italian prefixes.
Bruno Cartoni
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this article, we present an experiment that aims to evaluate the feasibility of a superficial morphological analysis, to analyse unknown constructed neologisms. For any morphosyntactic analyser, lexical incompleteness is a real problem. This lack of information is partly due to lexical creativity, and more especially to the productivity of some morphological processes. We present here a set of word formation rules based on constructional morphology principles that can be used to improve the performance of an Italian morphosyntactic analyser. These rules use only simple computing techniques in order to ensure efficiency because any improvements in coverage must not slow down the entire system. In the second part of this paper, we describe a method for constraining the rules, and an evaluation of these constraints in terms of performance. Great improvements are achieved in reducing the number of incorrect analyses of unknown neologisms (“noise”), although this is at the cost of some increase in “silence” (correct analyses which are no longer produced). This classic trade-off between “noise” and “silence”, however, can hardly be avoided and we believe that this experiment successfully demonstrates the feasibility of superficial analysis in improving performance and points the way to other avenues of research.

2004

pdf bib
Automatisation of the Activity of Term Collection in Different Languages
Bruno Cartoni | Pierrette Bouillon | Yalina Alphonse | Sabine Lehmann
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This article describes the use and development of a tool for grammar and terminology control (FLAG), for the purposes of automating the verification of terminology for a large-scale user of multilingual terminology. It describes the various advantages of the tool and shows a process for transforming a traditional terminology list into a list of inflected forms as well as patterns which can be used to find possible morpho-syntactic derivations of terms.

pdf bib
Semi-Automatic Derivation of a French Lexicon from CLIPS
Nilda Ruimy | Pierrette Bouillon | Bruno Cartoni
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)