Benjamin Marie


2020

pdf bib
Tagged Back-translation Revisited: Why Does It Really Work?
Benjamin Marie | Raphael Rubino | Atsushi Fujita
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In this paper, we show that neural machine translation (NMT) systems trained on large back-translated data overfit some of the characteristics of machine-translated texts. Such NMT systems better translate human-produced translations, i.e., translationese, but may largely worsen the translation quality of original texts. Our analysis reveals that adding a simple tag to back-translations prevents this quality degradation and improves on average the overall translation quality by helping the NMT system to distinguish back-translated data from original parallel data during training. We also show that, in contrast to high-resource configurations, NMT systems trained in low-resource settings are much less vulnerable to overfit back-translations. We conclude that the back-translations in the training data should always be tagged especially when the origin of the text to be translated is unknown.

pdf bib
Synthesizing Parallel Data of User-Generated Texts with Zero-Shot Neural Machine Translation
Benjamin Marie | Atsushi Fujita
Transactions of the Association for Computational Linguistics, Volume 8

Neural machine translation (NMT) systems are usually trained on clean parallel data. They can perform very well for translating clean in-domain texts. However, as demonstrated by previous work, the translation quality significantly worsens when translating noisy texts, such as user-generated texts (UGT) from online social media. Given the lack of parallel data of UGT that can be used to train or adapt NMT systems, we synthesize parallel data of UGT, exploiting monolingual data of UGT through crosslingual language model pre-training and zero-shot NMT systems. This paper presents two different but complementary approaches: One alters given clean parallel data into UGT-like parallel data whereas the other generates translations from monolingual data of UGT. On the MTNT translation tasks, we show that our synthesized parallel data can lead to better NMT systems for UGT while making them more robust in translating texts from various domains and styles.

2019

pdf bib
Supervised and Unsupervised Machine Translation for Myanmar-English and Khmer-English
Benjamin Marie | Hour Kaing | Aye Myat Mon | Chenchen Ding | Atsushi Fujita | Masao Utiyama | Eiichiro Sumita
Proceedings of the 6th Workshop on Asian Translation

This paper presents the NICT’s supervised and unsupervised machine translation systems for the WAT2019 Myanmar-English and Khmer-English translation tasks. For all the translation directions, we built state-of-the-art supervised neural (NMT) and statistical (SMT) machine translation systems, using monolingual data cleaned and normalized. Our combination of NMT and SMT performed among the best systems for the four translation directions. We also investigated the feasibility of unsupervised machine translation for low-resource and distant language pairs and confirmed observations of previous work showing that unsupervised MT is still largely unable to deal with them.

pdf bib
Unsupervised Joint Training of Bilingual Word Embeddings
Benjamin Marie | Atsushi Fujita
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

State-of-the-art methods for unsupervised bilingual word embeddings (BWE) train a mapping function that maps pre-trained monolingual word embeddings into a bilingual space. Despite its remarkable results, unsupervised mapping is also well-known to be limited by the original dissimilarity between the word embedding spaces to be mapped. In this work, we propose a new approach that trains unsupervised BWE jointly on synthetic parallel data generated through unsupervised machine translation. We demonstrate that existing algorithms that jointly train BWE are very robust to noisy training data and show that unsupervised BWE jointly trained significantly outperform unsupervised mapped BWE in several cross-lingual NLP tasks.

pdf bib
NICT’s Supervised Neural Machine Translation Systems for the WMT19 News Translation Task
Raj Dabre | Kehai Chen | Benjamin Marie | Rui Wang | Atsushi Fujita | Masao Utiyama | Eiichiro Sumita
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

In this paper, we describe our supervised neural machine translation (NMT) systems that we developed for the news translation task for Kazakh↔English, Gujarati↔English, Chinese↔English, and English→Finnish translation directions. We focused on leveraging multilingual transfer learning and back-translation for the extremely low-resource language pairs: Kazakh↔English and Gujarati↔English translation. For the Chinese↔English translation, we used the provided parallel data augmented with a large quantity of back-translated monolingual data to train state-of-the-art NMT systems. We then employed techniques that have been proven to be most effective, such as back-translation, fine-tuning, and model ensembling, to generate the primary submissions of Chinese↔English. For English→Finnish, our submission from WMT18 remains a strong baseline despite the increase in parallel corpora for this year’s task.

pdf bib
NICT’s Unsupervised Neural and Statistical Machine Translation Systems for the WMT19 News Translation Task
Benjamin Marie | Haipeng Sun | Rui Wang | Kehai Chen | Atsushi Fujita | Masao Utiyama | Eiichiro Sumita
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper presents the NICT’s participation in the WMT19 unsupervised news translation task. We participated in the unsupervised translation direction: German-Czech. Our primary submission to the task is the result of a simple combination of our unsupervised neural and statistical machine translation systems. Our system is ranked first for the German-to-Czech translation task, using only the data provided by the organizers (“constraint’”), according to both BLEU-cased and human evaluation. We also performed contrastive experiments with other language pairs, namely, English-Gujarati and English-Kazakh, to better assess the effectiveness of unsupervised machine translation in for distant language pairs and in truly low-resource conditions.

pdf bib
NICT’s Machine Translation Systems for the WMT19 Similar Language Translation Task
Benjamin Marie | Raj Dabre | Atsushi Fujita
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

This paper presents the NICT’s participation in the WMT19 shared Similar Language Translation Task. We participated in the Spanish-Portuguese task. For both translation directions, we prepared state-of-the-art statistical (SMT) and neural (NMT) machine translation systems. Our NMT systems with the Transformer architecture were trained on the provided parallel data enlarged with a large quantity of back-translated monolingual data. Our primary submission to the task is the result of a simple combination of our SMT and NMT systems. According to BLEU, our systems were ranked second and third respectively for the Portuguese-to-Spanish and Spanish-to-Portuguese translation directions. For contrastive experiments, we also submitted outputs generated with an unsupervised SMT system.

pdf bib
Unsupervised Extraction of Partial Translations for Neural Machine Translation
Benjamin Marie | Atsushi Fujita
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In neural machine translation (NMT), monolingual data are usually exploited through a so-called back-translation: sentences in the target language are translated into the source language to synthesize new parallel data. While this method provides more training data to better model the target language, on the source side, it only exploits translations that the NMT system is already able to generate using a model trained on existing parallel data. In this work, we assume that new translation knowledge can be extracted from monolingual data, without relying at all on existing parallel data. We propose a new algorithm for extracting from monolingual data what we call partial translations: pairs of source and target sentences that contain sequences of tokens that are translations of each other. Our algorithm is fully unsupervised and takes only source and target monolingual data as input. Our empirical evaluation points out that our partial translations can be used in combination with back-translation to further improve NMT models. Furthermore, while partial translations are particularly useful for low-resource language pairs, they can also be successfully exploited in resource-rich scenarios to improve translation quality.

2018

pdf bib
A Smorgasbord of Features to Combine Phrase-Based and Neural Machine Translation
Benjamin Marie | Atsushi Fujita
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
NICT’s Neural and Statistical Machine Translation Systems for the WMT18 News Translation Task
Benjamin Marie | Rui Wang | Atsushi Fujita | Masao Utiyama | Eiichiro Sumita
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper presents the NICT’s participation to the WMT18 shared news translation task. We participated in the eight translation directions of four language pairs: Estonian-English, Finnish-English, Turkish-English and Chinese-English. For each translation direction, we prepared state-of-the-art statistical (SMT) and neural (NMT) machine translation systems. Our NMT systems were trained with the transformer architecture using the provided parallel data enlarged with a large quantity of back-translated monolingual data that we generated with a new incremental training framework. Our primary submissions to the task are the result of a simple combination of our SMT and NMT systems. Our systems are ranked first for the Estonian-English and Finnish-English language pairs (constraint) according to BLEU-cased.

pdf bib
NICT’s Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task
Rui Wang | Benjamin Marie | Masao Utiyama | Eiichiro Sumita
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper presents the NICT’s participation in the WMT18 shared parallel corpus filtering task. The organizers provided 1 billion words German-English corpus crawled from the web as part of the Paracrawl project. This corpus is too noisy to build an acceptable neural machine translation (NMT) system. Using the clean data of the WMT18 shared news translation task, we designed several features and trained a classifier to score each sentence pairs in the noisy data. Finally, we sampled 100 million and 10 million words and built corresponding NMT systems. Empirical results show that our NMT systems trained on sampled data achieve promising performance.

pdf bib
Combination of Statistical and Neural Machine Translation for Myanmar-English
Benjamin Marie | Atsushi Fujita | Eiichiro Sumita
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

2017

pdf bib
Efficient Extraction of Pseudo-Parallel Sentences from Raw Monolingual Data Using Word Embeddings
Benjamin Marie | Atsushi Fujita
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We propose a new method for extracting pseudo-parallel sentences from a pair of large monolingual corpora, without relying on any document-level information. Our method first exploits word embeddings in order to efficiently evaluate trillions of candidate sentence pairs and then a classifier to find the most reliable ones. We report significant improvements in domain adaptation for statistical machine translation when using a translation model trained on the sentence pairs extracted from in-domain monolingual corpora.

pdf bib
Phrase Table Induction Using In-Domain Monolingual Data for Domain Adaptation in Statistical Machine Translation
Benjamin Marie | Atsushi Fujita
Transactions of the Association for Computational Linguistics, Volume 5

We present a new framework to induce an in-domain phrase table from in-domain monolingual data that can be used to adapt a general-domain statistical machine translation system to the targeted domain. Our method first compiles sets of phrases in source and target languages separately and generates candidate phrase pairs by taking the Cartesian product of the two phrase sets. It then computes inexpensive features for each candidate phrase pair and filters them using a supervised classifier in order to induce an in-domain phrase table. We experimented on the language pair English–French, both translation directions, in two domains and obtained consistently better results than a strong baseline system that uses an in-domain bilingual lexicon. We also conducted an error analysis that showed the induced phrase tables proposed useful translations, especially for words and phrases unseen in the parallel data used to train the general-domain baseline system.

2015

pdf bib
Touch-Based Pre-Post-Editing of Machine Translation Output
Benjamin Marie | Aurélien Max
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
METEOR-WSD: Improved Sense Matching in MT Evaluation
Marianna Apidianaki | Benjamin Marie
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
LIMSI@WMT’15 : Translation Task
Benjamin Marie | Alexandre Allauzen | Franck Burlot | Quoc-Khanh Do | Julia Ive | Elena Knyazeva | Matthieu Labeau | Thomas Lavergne | Kevin Löser | Nicolas Pécheux | François Yvon
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Alignment-based sense selection in METEOR and the RATATOUILLE recipe
Benjamin Marie | Marianna Apidianaki
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Multi-Pass Decoding With Complex Feature Guidance for Statistical Machine Translation
Benjamin Marie | Aurélien Max
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
LIMSI @ WMT’14 Medical Translation Task
Nicolas Pécheux | Li Gong | Quoc Khanh Do | Benjamin Marie | Yulia Ivanishcheva | Alexander Allauzen | Thomas Lavergne | Jan Niehues | Aurélien Max | François Yvon
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Confidence-based Rewriting of Machine Translation Output
Benjamin Marie | Aurélien Max
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)