François Yvon


2020

pdf bib
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm | Katrin Marheinecke | Stefanie Hegele | Stelios Piperidis | Kalina Bontcheva | Jan Hajič | Khalid Choukri | Andrejs Vasiļjevs | Gerhard Backfried | Christoph Prinz | José Manuel Gómez-Pérez | Luc Meertens | Paul Lukowicz | Josef van Genabith | Andrea Lösch | Philipp Slusallek | Morten Irgens | Patrick Gatellier | Joachim Köhler | Laure Le Bars | Dimitra Anastasiou | Albina Auksoriūtė | Núria Bel | António Branco | Gerhard Budin | Walter Daelemans | Koenraad De Smedt | Radovan Garabík | Maria Gavriilidou | Dagmar Gromann | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Jan Odijk | Maciej Ogrodniczuk | Eiríkur Rögnvaldsson | Mike Rosner | Bolette Pedersen | Inguna Skadiņa | Marko Tadić | Dan Tufiș | Tamás Váradi | Kadri Vider | Andy Way | François Yvon
Proceedings of the 12th Language Resources and Evaluation Conference

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

pdf bib
SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings
Masoud Jalili Sabet | Philipp Dufter | François Yvon | Hinrich Schütze
Findings of the Association for Computational Linguistics: EMNLP 2020

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings – both static and contextualized – for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners – even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.

pdf bib
Generative latent neural models for automatic word alignment
Anh Khoa Ngo Ho | François Yvon
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

2019

pdf bib
Measuring text readability with machine comprehension: a pilot study
Marc Benzahra | François Yvon
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

This article studies the relationship between text readability indice and automatic machine understanding systems. Our hypothesis is that the simpler a text is, the better it should be understood by a machine. We thus expect to a strong correlation between readability levels on the one hand, and performance of automatic reading systems on the other hand. We test this hypothesis with several understanding systems based on language models of varying strengths, measuring this correlation on two corpora of journalistic texts. Our results suggest that this correlation is rather small that existing comprehension systems are far to reproduce the gradual improvement of their performance on texts of decreasing complexity.

pdf bib
How Bad are PoS Tagger in Cross-Corpora Settings? Evaluating Annotation Divergence in the UD Project.
Guillaume Wisniewski | François Yvon
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The performance of Part-of-Speech tagging varies significantly across the treebanks of the Universal Dependencies project. This work points out that these variations may result from divergences between the annotation of train and test sets. We show how the annotation variation principle, introduced by Dickinson and Meurers (2003) to automatically detect errors in gold standard, can be used to identify inconsistencies between annotations; we also evaluate their impact on prediction performance.

2018

pdf bib
A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments
Pierre Godard | Gilles Adda | Martine Adda-Decker | Juan Benjumea | Laurent Besacier | Jamison Cooper-Leavitt | Guy-Noel Kouarata | Lori Lamel | Hélène Maynard | Markus Mueller | Annie Rialland | Sebastian Stueker | François Yvon | Marcely Zanon-Boito
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Automatically Selecting the Best Dependency Annotation Design with Dynamic Oracles
Guillaume Wisniewski | Ophélie Lacroix | François Yvon
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

This work introduces a new strategy to compare the numerous conventions that have been proposed over the years for expressing dependency structures and discover the one for which a parser will achieve the highest parsing performance. Instead of associating each sentence in the training set with a single gold reference we propose to consider a set of references encoding alternative syntactic representations. Training a parser with a dynamic oracle will then automatically select among all alternatives the reference that will be predicted with the highest accuracy. Experiments on the UD corpora show the validity of this approach.

pdf bib
Exploiting Dynamic Oracles to Train Projective Dependency Parsers on Non-Projective Trees
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Because the most common transition systems are projective, training a transition-based dependency parser often implies to either ignore or rewrite the non-projective training examples, which has an adverse impact on accuracy. In this work, we propose a simple modification of dynamic oracles, which enables the use of non-projective data when training projective parsers. Evaluation on 73 treebanks shows that our method achieves significant gains (+2 to +7 UAS for the most non-projective languages) and consistently outperforms traditional projectivization and pseudo-projectivization approaches.

pdf bib
Fixing Translation Divergences in Parallel Corpora for Neural MT
MinhQuang Pham | Josep Crego | Jean Senellart | François Yvon
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Corpus-based approaches to machine translation rely on the availability of clean parallel corpora. Such resources are scarce, and because of the automatic processes involved in their preparation, they are often noisy. This paper describes an unsupervised method for detecting translation divergences in parallel sentences. We rely on a neural network that computes cross-lingual sentence similarity scores, which are then used to effectively filter out divergent translations. Furthermore, similarity scores predicted by the network are used to identify and fix some partial divergences, yielding additional parallel segments. We evaluate these methods for English-French and English-German machine translation tasks, and show that using filtered/corrected corpora actually improves MT performance.

pdf bib
Quantifying training challenges of dependency parsers
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 27th International Conference on Computational Linguistics

Not all dependencies are equal when training a dependency parser: some are straightforward enough to be learned with only a sample of data, others embed more complexity. This work introduces a series of metrics to quantify those differences, and thereby to expose the shortcomings of various parsing algorithms and strategies. Apart from a more thorough comparison of parsing systems, these new tools also prove useful for characterizing the information conveyed by cross-lingual parsers, in a quantitative but still interpretable way.

pdf bib
Adaptor Grammars for the Linguist: Word Segmentation Experiments for Very Low-Resource Languages
Pierre Godard | Laurent Besacier | François Yvon | Martine Adda-Decker | Gilles Adda | Hélène Maynard | Annie Rialland
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

Computational Language Documentation attempts to make the most recent research in speech and language technologies available to linguists working on language preservation and documentation. In this paper, we pursue two main goals along these lines. The first is to improve upon a strong baseline for the unsupervised word discovery task on two very low-resource Bantu languages, taking advantage of the expertise of linguists on these particular languages. The second consists in exploring the Adaptor Grammar framework as a decision and prediction tool for linguists studying a new language. We experiment 162 grammar configurations for each language and show that using Adaptor Grammars for word segmentation enables us to test hypotheses about a language. Specializing a generic grammar with language specific knowledge leads to great improvements for the word discovery task, ultimately achieving a leap of about 30% token F-score from the results of a strong baseline.

pdf bib
Using Monolingual Data in Neural Machine Translation: a Systematic Study
Franck Burlot | François Yvon
Proceedings of the Third Conference on Machine Translation: Research Papers

Neural Machine Translation (MT) has radically changed the way systems are developed. A major difference with the previous generation (Phrase-Based MT) is the way monolingual target data, which often abounds, is used in these two paradigms. While Phrase-Based MT can seamlessly integrate very large language models trained on billions of sentences, the best option for Neural MT developers seems to be the generation of artificial parallel data through back-translation - a technique that fails to fully take advantage of existing datasets. In this paper, we conduct a systematic study of back-translation, comparing alternative uses of monolingual data, as well as multiple data generation procedures. Our findings confirm that back-translation is very effective and give new explanations as to why this is the case. We also introduce new data simulation techniques that are almost as effective, yet much cheaper to implement.

pdf bib
The WMT’18 Morpheval test suites for English-Czech, English-German, English-Finnish and Turkish-English
Franck Burlot | Yves Scherrer | Vinit Ravishankar | Ondřej Bojar | Stig-Arne Grönroos | Maarit Koponen | Tommi Nieminen | François Yvon
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

Progress in the quality of machine translation output calls for new automatic evaluation procedures and metrics. In this paper, we extend the Morpheval protocol introduced by Burlot and Yvon (2017) for the English-to-Czech and English-to-Latvian translation directions to three additional language pairs, and report its use to analyze the results of WMT 2018’s participants for these language pairs. Considering additional, typologically varied source and target languages also enables us to draw some generalizations regarding this morphology-oriented evaluation procedure.

pdf bib
Évaluation morphologique pour la traduction automatique : adaptation au français (Morphological Evaluation for Machine Translation : Adaptation to French)
Franck Burlot | François Yvon
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Le nouvel état de l’art en traduction automatique (TA) s’appuie sur des méthodes neuronales, qui différent profondément des méthodes utilisées antérieurement. Les métriques automatiques classiques sont mal adaptées pour rendre compte de la nature du saut qualitatif observé. Cet article propose un protocole d’évaluation pour la traduction de l’anglais vers le français spécifiquement focalisé sur la compétence morphologique des systèmes de TA, en étudiant leurs performances sur différents phénomènes grammaticaux.

pdf bib
Divergences entre annotations dans le projet Universal Dependencies et leur impact sur l’évaluation des performance d’étiquetage morpho-syntaxique (Evaluating Annotation Divergences in the UD Project)
Guillaume Wisniewski | François Yvon
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

Ce travail montre que la dégradation des performances souvent observée lors de l’application d’un analyseur morpho-syntaxique à des données hors domaine résulte souvent d’incohérences entre les annotations des ensembles de test et d’apprentissage. Nous montrons comment le principe de variation des annotations, introduit par Dickinson & Meurers (2003) pour identifier automatiquement les erreurs d’annotation, peut être utilisé pour identifier ces incohérences et évaluer leur impact sur les performances des analyseurs morpho-syntaxiques.

2017

pdf bib
Normalisation automatique du vocabulaire source pour traduire depuis une langue à morphologie riche (Learning Morphological Normalization for Translation from Morphologically Rich Languages)
Franck Burlot | François Yvon
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 - Articles longs

Lorsqu’ils sont traduits depuis une langue à morphologie riche vers l’anglais, les mots-formes sources contiennent des marques d’informations grammaticales pouvant être jugées redondantes par rapport à l’anglais, causant une variabilité formelle qui nuit à l’estimation des modèles probabilistes. Un moyen bien documenté pour atténuer ce problème consiste à supprimer l’information non pertinente de la source en la normalisant. Ce pré-traitement est généralement effectué de manière déterministe, à l’aide de règles produites manuellement. Une telle normalisation est, par essence, sous-optimale et doit être adaptée pour chaque paire de langues. Nous présentons, dans cet article, une méthode simple pour rechercher automatiquement une normalisation optimale de la morphologie source par rapport à la langue cible et montrons que celle-ci peut améliorer la traduction automatique.

pdf bib
Adaptation au domaine pour l’analyse morpho-syntaxique (Domain Adaptation for PoS tagging)
Éléonor Bartenlian | Margot Lacour | Matthieu Labeau | Alexandre Allauzen | Guillaume Wisniewski | François Yvon
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Ce travail cherche à comprendre pourquoi les performances d’un analyseur morpho-syntaxiques chutent fortement lorsque celui-ci est utilisé sur des données hors domaine. Nous montrons à l’aide d’une expérience jouet que ce comportement peut être dû à un phénomène de masquage des caractéristiques lexicalisées par les caractéristiques non lexicalisées. Nous proposons plusieurs modèles essayant de réduire cet effet.

pdf bib
LIMSI@CoNLL’17: UD Shared Task
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes LIMSI’s submission to the CoNLL 2017 UD Shared Task, which is focused on small treebanks, and how to improve low-resourced parsing only by ad hoc combination of multiple views and resources. We present our approach for low-resourced parsing, together with a detailed analysis of the results for each test treebank. We also report extensive analysis experiments on model selection for the PUD treebanks, and on annotation consistency among UD treebanks.

pdf bib
Don’t Stop Me Now! Using Global Dynamic Oracles to Correct Training Biases of Transition-Based Dependency Parsers
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper formalizes a sound extension of dynamic oracles to global training, in the frame of transition-based dependency parsers. By dispensing with the pre-computation of references, this extension widens the training strategies that can be entertained for such parsers; we show this by revisiting two standard training procedures, early-update and max-violation, to correct some of their search space sampling biases. Experimentally, on the SPMRL treebanks, this improvement increases the similarity between the train and test distributions and yields performance improvements up to 0.7 UAS, without any computation overhead.

pdf bib
Word Representations in Factored Neural Machine Translation
Franck Burlot | Mercedes García-Martínez | Loïc Barrault | Fethi Bougares | François Yvon
Proceedings of the Second Conference on Machine Translation

pdf bib
Evaluating the morphological competence of Machine Translation Systems
Franck Burlot | François Yvon
Proceedings of the Second Conference on Machine Translation

pdf bib
LIMSI@WMT’17
Franck Burlot | Pooyan Safari | Matthieu Labeau | Alexandre Allauzen | François Yvon
Proceedings of the Second Conference on Machine Translation

pdf bib
The QT21 Combined Machine Translation System for English to Latvian
Jan-Thorsten Peter | Hermann Ney | Ondřej Bojar | Ngoc-Quan Pham | Jan Niehues | Alex Waibel | Franck Burlot | François Yvon | Mārcis Pinnis | Valters Šics | Jasmijn Bastings | Miguel Rios | Wilker Aziz | Philip Williams | Frédéric Blain | Lucia Specia
Proceedings of the Second Conference on Machine Translation

pdf bib
Learning the Structure of Variable-Order CRFs: a finite-state perspective
Thomas Lavergne | François Yvon
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The computational complexity of linear-chain Conditional Random Fields (CRFs) makes it difficult to deal with very large label sets and long range dependencies. Such situations are not rare and arise when dealing with morphologically rich languages or joint labelling tasks. We extend here recent proposals to consider variable order CRFs. Using an effective finite-state representation of variable-length dependencies, we propose new ways to perform feature selection at large scale and report experimental results where we outperform strong baselines on a tagging task.

2016

pdf bib
Novel elicitation and annotation schemes for sentential and sub-sentential alignments of bitexts
Yong Xu | François Yvon
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Resources for evaluating sentence-level and word-level alignment algorithms are unsatisfactory. Regarding sentence alignments, the existing data is too scarce, especially when it comes to difficult bitexts, containing instances of non-literal translations. Regarding word-level alignments, most available hand-aligned data provide a complete annotation at the level of words that is difficult to exploit, for lack of a clear semantics for alignment links. In this study, we propose new methodologies for collecting human judgements on alignment links, which have been used to annotate 4 new data sets, at the sentence and at the word level. These will be released online, with the hope that they will prove useful to evaluate alignment software and quality estimation tools for automatic alignment. Keywords: Parallel corpora, Sentence Alignments, Word Alignments, Confidence Estimation

pdf bib
Cross-lingual and Supervised Models for Morphosyntactic Annotation: a Comparison on Romanian
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Because of the small size of Romanian corpora, the performance of a PoS tagger or a dependency parser trained with the standard supervised methods fall far short from the performance achieved in most languages. That is why, we apply state-of-the-art methods for cross-lingual transfer on Romanian tagging and parsing, from English and several Romance languages. We compare the performance with monolingual systems trained with sets of different sizes and establish that training on a few sentences in target language yields better results than transferring from large datasets in other languages.

pdf bib
Frustratingly Easy Cross-Lingual Transfer for Transition-Based Dependency Parsing
Ophélie Lacroix | Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
TransRead: Designing a Bilingual Reading Experience with Machine Translation Technologies
François Yvon | Yong Xu | Marianna Apidianaki | Clément Pillias | Pierre Cubaud
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf bib
Zero-resource Dependency Parsing: Boosting Delexicalized Cross-lingual Transfer with Linguistic Knowledge
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper studies cross-lingual transfer for dependency parsing, focusing on very low-resource settings where delexicalized transfer is the only fully automatic option. We show how to boost parsing performance by rewriting the source sentences so as to better match the linguistic regularities of the target language. We contrast a data-driven approach with an approach relying on linguistically motivated rules automatically extracted from the World Atlas of Language Structures. Our findings are backed up by experiments involving 40 languages. They show that both approaches greatly outperform the baseline, the knowledge-driven method yielding the best accuracies, with average improvements of +2.9 UAS, and up to +90 UAS (absolute) on some frequent PoS configurations.

pdf bib
Parallel Sentence Compression
Julia Ive | François Yvon
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Sentence compression is a way to perform text simplification and is usually handled in a monolingual setting. In this paper, we study ways to extend sentence compression in a bilingual context, where the goal is to obtain parallel compressions of parallel sentences. This can be beneficial for a series of multilingual natural language processing (NLP) tasks. We compare two ways to take bilingual information into account when compressing parallel sentences. Their efficiency is contrasted on a parallel corpus of News articles.

pdf bib
Cross-lingual Dependency Transfer : What Matters? Assessing the Impact of Pre- and Post-processing
Ophélie Lacroix | Guillaume Wisniewski | François Yvon
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

pdf bib
Cross-lingual alignment transfer: a chicken-and-egg story?
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

pdf bib
LIMSI@WMT’16: Machine Translation of News
Alexandre Allauzen | Lauriane Aufrant | Franck Burlot | Ophélie Lacroix | Elena Knyazeva | Thomas Lavergne | Guillaume Wisniewski | François Yvon
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
The QT21/HimL Combined Machine Translation System
Jan-Thorsten Peter | Tamer Alkhouli | Hermann Ney | Matthias Huck | Fabienne Braune | Alexander Fraser | Aleš Tamchyna | Ondřej Bojar | Barry Haddow | Rico Sennrich | Frédéric Blain | Lucia Specia | Jan Niehues | Alex Waibel | Alexandre Allauzen | Lauriane Aufrant | Franck Burlot | Elena Knyazeva | Thomas Lavergne | François Yvon | Mārcis Pinnis | Stella Frank
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
LIMSI’s Contribution to the WMT’16 Biomedical Translation Task
Julia Ive | Aurélien Max | François Yvon
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Apprentissage d’analyseur en dépendances cross-lingue par projection partielle de dépendances (Cross-lingual learning of dependency parsers from partially projected dependencies )
Ophélie Lacroix | Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Cet article présente une méthode simple de transfert cross-lingue de dépendances. Nous montrons tout d’abord qu’il est possible d’apprendre un analyseur en dépendances par transition à partir de données partiellement annotées. Nous proposons ensuite de construire de grands ensembles de données partiellement annotés pour plusieurs langues cibles en projetant les dépendances via les liens d’alignement les plus sûrs. En apprenant des analyseurs pour les langues cibles à partir de ces données partielles, nous montrons que cette méthode simple obtient des performances qui rivalisent avec celles de méthodes état-de-l’art récentes, tout en ayant un coût algorithmique moindre.

pdf bib
Ne nous arrêtons pas en si bon chemin : améliorations de l’apprentissage global d’analyseurs en dépendances par transition (Don’t Stop Me Now ! Improved Update Strategies for Global Training of Transition-Based)
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Dans cet article, nous proposons trois améliorations simples pour l’apprentissage global d’analyseurs en dépendances par transition de type A RC E AGER : un oracle non déterministe, la reprise sur le même exemple après une mise à jour et l’entraînement en configurations sous-optimales. Leur combinaison apporte un gain moyen de 0,2 UAS sur le corpus SPMRL. Nous introduisons également un cadre général permettant la comparaison systématique de ces stratégies et de la plupart des variantes connues. Nous montrons que la littérature n’a étudié que quelques stratégies parmi les nombreuses variations possibles, négligeant ainsi plusieurs pistes d’améliorations potentielles.

pdf bib
Lecture bilingue augmentée par des alignements multi-niveaux (Augmenting bilingual reading with alignment information)
François Yvon | Yong Xu | Marianna Apidianaki | Clément Pillias | Cubaud Pierre
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 5 : Démonstrations

Le travail qui a conduit à cette démonstration combine des outils de traitement des langues multilingues, en particulier l’alignement automatique, avec des techniques de visualisation et d’interaction. Il vise à proposer des pistes pour le développement d’outils permettant de lire simultanément les différentes versions d’un texte disponible en plusieurs langues, avec des applications en lecture de loisir ou en lecture professionnelle.

2015

pdf bib
Sentence alignment for literary texts: The state-of-the-art and beyond
Yong Xu | Aurélien Max | François Yvon
Linguistic Issues in Language Technology, Volume 12, 2015 - Literature Lifts up Computational Linguistics

Literary works are becoming increasingly available in electronic formats, thus quickly transforming editorial processes and reading habits. In the context of the global enthusiasm for multilingualism, the rapid spread of e-book readers, such as Amazon Kindle R or Kobo Touch R , fosters the development of a new generation of reading tools for bilingual books. In particular, literary works, when available in several languages, offer an attractive perspective for self-development or everyday leisure reading, but also for activities such as language learning, translation or literary studies. An important issue in the automatic processing of multilingual e-books is the alignment between textual units. Alignment could help identify corresponding text units in different languages, which would be particularly beneficial to bilingual readers and translation professionals. Computing automatic alignments for literary works, however, is a task more challenging than in the case of better behaved corpora such as parliamentary proceedings or technical manuals. In this paper, we revisit the problem of computing high-quality. alignment for literary works. We first perform a large-scale evaluation of automatic alignment for literary texts, which provides a fair assessment of the actual difficulty of this task. We then introduce a two-pass approach, based on a maximum entropy model. Experimental results for novels available in English and French or in English and Spanish demonstrate the effectiveness of our method.

pdf bib
A Discriminative Training Procedure for Continuous Translation Models
Quoc-Khanh Do | Alexandre Allauzen | François Yvon
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Apprentissage par imitation pour l’étiquetage de séquences : vers une formalisation des méthodes d’étiquetage easy-first
Elena Knyazeva | Guillaume Wisniewski | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

De nombreuses méthodes ont été proposées pour accélérer la prédiction d’objets structurés (tels que les arbres ou les séquences), ou pour permettre la prise en compte de dépendances plus riches afin d’améliorer les performances de la prédiction. Ces méthodes reposent généralement sur des techniques d’inférence approchée et ne bénéficient d’aucune garantie théorique aussi bien du point de vue de la qualité de la solution trouvée que du point de vue de leur critère d’apprentissage. Dans ce travail, nous étudions une nouvelle formulation de l’apprentissage structuré qui consiste à voir celui-ci comme un processus incrémental au cours duquel la sortie est construite de façon progressive. Ce cadre permet de formaliser plusieurs approches de prédiction structurée existantes. Grâce au lien que nous faisons entre apprentissage structuré et apprentissage par renforcement, nous sommes en mesure de proposer une méthode théoriquement bien justifiée pour apprendre des méthodes d’inférence approchée. Les expériences que nous réalisons sur quatre tâches de TAL valident l’approche proposée.

pdf bib
Oublier ce qu’on sait, pour mieux apprendre ce qu’on ne sait pas : une étude sur les contraintes de type dans les modèles CRF
Nicolas Pécheux | Alexandre Allauzen | Thomas Lavergne | Guillaume Wisniewski | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Quand on dispose de connaissances a priori sur les sorties possibles d’un problème d’étiquetage, il semble souhaitable d’inclure cette information lors de l’apprentissage pour simplifier la tâche de modélisation et accélérer les traitements. Pourtant, même lorsque ces contraintes sont correctes et utiles au décodage, leur utilisation lors de l’apprentissage peut dégrader sévèrement les performances. Dans cet article, nous étudions ce paradoxe et montrons que le manque de contraste induit par les connaissances entraîne une forme de sous-apprentissage qu’il est cependant possible de limiter.

pdf bib
Apprentissage discriminant des modèles continus de traduction
Quoc-Khanh Do | Alexandre Allauzen | François Yvon
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Alors que les réseaux neuronaux occupent une place de plus en plus importante dans le traitement automatique des langues, les méthodes d’apprentissage actuelles utilisent pour la plupart des critères qui sont décorrélés de l’application. Cet article propose un nouveau cadre d’apprentissage discriminant pour l’estimation des modèles continus de traduction. Ce cadre s’appuie sur la définition d’un critère d’optimisation permettant de prendre en compte d’une part la métrique utilisée pour l’évaluation de la traduction et d’autre part l’intégration de ces modèles au sein des systèmes de traduction automatique. De plus, cette méthode d’apprentissage est comparée aux critères existants d’estimation que sont le maximum de vraisemblance et l’estimation contrastive bruitée. Les expériences menées sur la tâches de traduction des séminaires TED Talks de l’anglais vers le français montrent la pertinence d’un cadre discriminant d’apprentissage, dont les performances restent toutefois très dépendantes du choix d’une stratégie d’initialisation idoine. Nous montrons qu’avec une initialisation judicieuse des gains significatifs en termes de scores BLEU peuvent être obtenus.

pdf bib
The KIT-LIMSI Translation System for WMT 2015
Thanh-Le Ha | Quoc-Khanh Do | Eunah Cho | Jan Niehues | Alexandre Allauzen | François Yvon | Alex Waibel
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
LIMSI@WMT’15 : Translation Task
Benjamin Marie | Alexandre Allauzen | Franck Burlot | Quoc-Khanh Do | Julia Ive | Elena Knyazeva | Matthieu Labeau | Thomas Lavergne | Kevin Löser | Nicolas Pécheux | François Yvon
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Why Predicting Post-Edition is so Hard? Failure Analysis of LIMSI Submission to the APE Shared Task
Guillaume Wisniewski | Nicolas Pécheux | François Yvon
Proceedings of the Tenth Workshop on Statistical Machine Translation

2014

pdf bib
The KIT-LIMSI Translation System for WMT 2014
Quoc Khanh Do | Teresa Herrmann | Jan Niehues | Alexander Allauzen | François Yvon | Alex Waibel
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
LIMSI @ WMT’14 Medical Translation Task
Nicolas Pécheux | Li Gong | Quoc Khanh Do | Benjamin Marie | Yulia Ivanishcheva | Alexander Allauzen | Thomas Lavergne | Jan Niehues | Aurélien Max | François Yvon
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
LIMSI Submission for WMT’14 QE Task
Guillaume Wisniewski | Nicolas Pécheux | Alexander Allauzen | François Yvon
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Cross-Lingual Part-of-Speech Tagging through Ambiguous Learning
Guillaume Wisniewski | Nicolas Pécheux | Souhir Gahbiche-Braham | François Yvon
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Cross-Lingual POS Tagging through Ambiguous Learning: First Experiments (Apprentissage partiellement supervisé d’un étiqueteur morpho-syntaxique par transfert cross-lingue) [in French]
Guillaume Wisniewski | Nicolas Pécheux | Elena Knyazeva | Alexandre Allauzen | François Yvon
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Comparison of scheduling methods for the learning rate of neural network language models (Modèles de langue neuronaux: une comparaison de plusieurs stratégies d’apprentissage) [in French]
Quoc-Khanh Do | Alexandre Allauzen | François Yvon
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Topic Adaptation for the Automatic Translation of News Articles (Adaptation thématique pour la traduction automatique de dépêches de presse) [in French]
Souhir Gahbiche-Braham | Hélène Bonneau-Maynard | François Yvon
Proceedings of TALN 2014 (Volume 1: Long Papers)

pdf bib
Towards a More Efficient Development of Statistical Machine Translation Systems (Vers un développement plus efficace des systèmes de traduction statistique : un peu de vert dans un monde de BLEU) [in French]
Li Gong | Aurélien Max | François Yvon
Proceedings of TALN 2014 (Volume 2: Short Papers)

pdf bib
(Much) Faster Construction of SMT Phrase Tables from Large-scale Parallel Corpora (Construction (très) rapide de tables de traduction à partir de grands bi-textes) [in French]
Li Gong | Aurélien Max | François Yvon
Proceedings of TALN 2014 (Volume 3: System Demonstrations)

pdf bib
A Corpus of Machine Translation Errors Extracted from Translation Students Exercises
Guillaume Wisniewski | Natalie Kübler | François Yvon
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present a freely available corpus of automatic translations accompanied with post-edited versions, annotated with labels identifying the different kinds of errors made by the MT system. These data have been extracted from translation students exercises that have been corrected by a senior professor. This corpus can be useful for training quality estimation tools and for analyzing the types of errors made MT system.

pdf bib
Rule-based Reordering Space in Statistical Machine Translation
Nicolas Pécheux | Alexander Allauzen | François Yvon
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In Statistical Machine Translation (SMT), the constraints on word reorderings have a great impact on the set of potential translations that are explored. Notwithstanding computationnal issues, the reordering space of a SMT system needs to be designed with great care: if a larger search space is likely to yield better translations, it may also lead to more decoding errors, because of the added ambiguity and the interaction with the pruning strategy. In this paper, we study this trade-off using a state-of-the art translation system, where all reorderings are represented in a word lattice prior to decoding. This allows us to directly explore and compare different reordering spaces. We study in detail a rule-based preordering system, varying the length or number of rules, the tagset used, as well as contrasting with oracle settings and purely combinatorial subsets of permutations. We focus on two language pairs: English-French, a close language pair and English-German, known to be a more challenging reordering pair.

2013

pdf bib
LIMSI @ WMT13
Alexander Allauzen | Nicolas Pécheux | Quoc Khanh Do | Marco Dinarelli | Thomas Lavergne | Aurélien Max | Hai-Son Le | François Yvon
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
LIMSI Submission for the WMT‘13 Quality Estimation Task: an Experiment with N-Gram Posteriors
Anil Kumar Singh | Guillaume Wisniewski | François Yvon
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
A fully discriminative training framework for Statistical Machine Translation (Un cadre d’apprentissage intégralement discriminant pour la traduction statistique) [in French]
Thomas Lavergne | Alexandre Allauzen | François Yvon
Proceedings of TALN 2013 (Volume 1: Long Papers)

pdf bib
A corpus of post-edited translations (Un corpus d’erreurs de traduction) [in French]
Guillaume Wisniewski | Anil Kumar Singh | Natalia Segal | François Yvon
Proceedings of TALN 2013 (Volume 2: Short Papers)

2012

pdf bib
Continuous Space Translation Models with Neural Networks
Hai Son Le | Alexandre Allauzen | François Yvon
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Hierarchical Sub-sentential Alignment with Anymalign
Adrien Lardilleux | François Yvon | Yves Lepage
Proceedings of the 16th Annual conference of the European Association for Machine Translation

pdf bib
Alignement sous-phrastique hiérarchique avec Anymalign (Hierarchical Sub-Sentential Alignment with Anymalign) [in French]
Adrien Lardilleux | François Yvon | Yves Lepage
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Repérage des entités nommées pour l’arabe : adaptation non-supervisée et combinaison de systèmes (Named Entity Recognition for Arabic : Unsupervised adaptation and Systems combination) [in French]
Souhir Gahbiche-Braham | Hélène Bonneau-Maynard | Thomas Lavergne | François Yvon
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Joint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier
Souhir Gahbiche-Braham | Hélène Bonneau-Maynard | Thomas Lavergne | François Yvon
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Arabic is a morphologically rich language, and Arabic texts abound of complex word forms built by concatenation of multiple subparts, corresponding for instance to prepositions, articles, roots prefixes, or suffixes. The development of Arabic Natural Language Processing applications, such as Machine Translation (MT) tools, thus requires some kind of morphological analysis. In this paper, we compare various strategies for performing such preprocessing, using generic machine learning techniques. The resulting tool is compared with two open domain alternatives in the context of a statistical MT task and is shown to be faster than its competitors, with no significant difference in MT quality.

pdf bib
Aligning Bilingual Literary Works: a Pilot Study
Qian Yu | Aurélien Max | François Yvon
Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature

pdf bib
Measuring the Influence of Long Range Dependencies with Neural Network Language Models
Hai Son Le | Alexandre Allauzen | François Yvon
Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT

pdf bib
Non-Linear Models for Confidence Estimation
Yong Zhuang | Guillaume Wisniewski | François Yvon
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
LIMSI @ WMT12
Hai-Son Le | Thomas Lavergne | Alexandre Allauzen | Marianna Apidianaki | Li Gong | Aurélien Max | Artem Sokolov | Guillaume Wisniewski | François Yvon
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
WSD for n-best reranking and local language modeling in SMT
Marianna Apidianaki | Guillaume Wisniewski | Artem Sokolov | Aurélien Max | François Yvon
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Computing Lattice BLEU Oracle Scores for Machine Translation
Artem Sokolov | Guillaume Wisniewski | François Yvon
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf bib
Two Ways to Use a Noisy Parallel News Corpus for Improving Statistical Machine Translation
Souhir Gahbiche-Braham | Hélène Bonneau-Maynard | François Yvon
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

pdf bib
LIMSI @ WMT11
Alexandre Allauzen | Hélène Bonneau-Maynard | Hai-Son Le | Aurélien Max | Guillaume Wisniewski | François Yvon | Gilles Adda | Josep Maria Crego | Adrien Lardilleux | Thomas Lavergne | Artem Sokolov
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
From n-gram-based to CRF-based Translation Models
Thomas Lavergne | Alexandre Allauzen | Josep Maria Crego | François Yvon
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Measuring the Confusability of Pronunciations in Speech Recognition
Panagiota Karanasou | François Yvon | Lori Lamel
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing

pdf bib
Minimum Error Rate Training Semiring
Artem Sokolov | François Yvon
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf bib
Discriminative Weighted Alignment Matrices For Statistical Machine Translation
Nadi Tomeh | Alexandre Allauzen | François Yvon
Proceedings of the 15th Annual conference of the European Association for Machine Translation

2010

pdf bib
Proceedings of the 14th Annual conference of the European Association for Machine Translation
François Yvon | Viggo Hansen
Proceedings of the 14th Annual conference of the European Association for Machine Translation

pdf bib
LIMSI’s Statistical Translation Systems for WMT’10
Alexandre Allauzen | Josep M. Crego | İlknur Durgar El-Kahlout | François Yvon
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Local lexical adaptation in Machine Translation through triangulation: SMT helping SMT
Josep Maria Crego | Aurélien Max | François Yvon
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Improving Reordering with Linguistically Informed Bilingual n-grams
Josep Maria Crego | François Yvon
Coling 2010: Posters

pdf bib
Practical Very Large Scale CRFs
Thomas Lavergne | Olivier Cappé | François Yvon
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Contrastive Lexical Evaluation of Machine Translation
Aurélien Max | Josep Maria Crego | François Yvon
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper advocates a complementary measure of translation performance that focuses on the constrastive ability of two or more systems or system versions to adequately translate source words. This is motivated by three main reasons : 1) existing automatic metrics sometimes do not show significant differences that can be revealed by fine-grained focussed human evaluation, 2) these metrics are based on direct comparisons between system hypotheses with the corresponding reference translations, thus ignoring the input words that were actually translated, and 3) as these metrics do not take input hypotheses from several systems at once, fine-grained contrastive evaluation can only be done indirectly. This proposal is illustrated on a multi-source Machine Translation scenario where multiple translations of a source text are available. Significant gains (up to +1.3 BLEU point) are achieved on these experiments, and contrastive lexical evaluation is shown to provide new information that can help to better analyse a system's performance.

pdf bib
Training Continuous Space Language Models: Some Practical Issues
Hai Son Le | Alexandre Allauzen | Guillaume Wisniewski | François Yvon
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Assessing Phrase-Based Translation Models with Oracle Decoding
Guillaume Wisniewski | Alexandre Allauzen | François Yvon
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2009

pdf bib
LIMSI‘s Statistical Translation Systems for WMT‘09
Alexandre Allauzen | Josep Crego | Aurélien Max | François Yvon
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Gappy Translation Units under Left-to-Right SMT Decoding
Josep M. Crego | François Yvon
Proceedings of the 13th Annual conference of the European Association for Machine Translation

pdf bib
Improvements in Analogical Learning: Application to Translating Multi-Terms of the Medical Domain
Philippe Langlais | François Yvon | Pierre Zweigenbaum
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

pdf bib
Limsi’s Statistical Translation Systems for WMT‘08
Daniel Déchelotte | Gilles Adda | Alexandre Allauzen | Hélène Bonneau-Maynard | Olivier Galibert | Jean-Luc Gauvain | Philippe Langlais | François Yvon
Proceedings of the Third Workshop on Statistical Machine Translation

pdf bib
Using LDA to detect semantically incoherent documents
Hemant Misra | Olivier Cappé | François Yvon
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

pdf bib
Normalizing SMS: are Two Metaphors Better than One ?
Catherine Kobus | François Yvon | Géraldine Damnati
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Robust Similarity Measures for Named Entities Matching
Erwan Moreau | François Yvon | Olivier Cappé
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Scaling up Analogical Learning
Philippe Langlais | François Yvon
Coling 2008: Companion volume: Posters

2005

pdf bib
An Analogical Learner for Morphological Analysis
Nicolas Stroppa | François Yvon
Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005)

2002

pdf bib
Using the Web as a Linguistic Resource for Learning Reformulations Automatically
Florence Duclaye | François Yvon | Olivier Collin
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

pdf bib
A French Phonetic Lexicon with Variants for Speech and Language Processing
Philippe Boula de Mareüil | Christophe d’Alessandro | François Yvon | Véronique Aubergé | Jacqueline Vaissière | Angélique Amelot
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

1997

pdf bib
Paradigmatic Cascades: A Linguistically Sound Model of Pronunciation by Analogy
François Yvon
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics

Search
Co-authors