Josep M. Crego

Also published as: Josep Crego, Josep Maria Crego


2020

pdf bib
Boosting Neural Machine Translation with Similar Translations
Jitao Xu | Josep Crego | Jean Senellart
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper explores data augmentation methods for training Neural Machine Translation to make use of similar translations, in a comparable way a human translator employs fuzzy matches. In particular, we show how we can simply present the neural model with information of both source and target sides of the fuzzy matches, we also extend the similarity to include semantically related translations retrieved using sentence distributed representations. We show that translations based on fuzzy matching provide the model with “copy” information while translations based on embedding similarities tend to extend the translation “context”. Results indicate that the effect from both similar sentences are adding up to further boost accuracy, combine naturally with model fine-tuning and are providing dynamic adaptation for unseen translation pairs. Tests on multiple data sets and domains show consistent accuracy improvements. To foster research around these techniques, we also release an Open-Source toolkit with efficient and flexible fuzzy-match implementation.

pdf bib
Efficient and High-Quality Neural Machine Translation with OpenNMT
Guillaume Klein | Dakun Zhang | Clément Chouteau | Josep Crego | Jean Senellart
Proceedings of the Fourth Workshop on Neural Generation and Translation

This paper describes the OpenNMT submissions to the WNGT 2020 efficiency shared task. We explore training and acceleration of Transformer models with various sizes that are trained in a teacher-student setup. We also present a custom and optimized C++ inference engine that enables fast CPU and GPU decoding with few dependencies. By combining additional optimizations and parallelization techniques, we create small, efficient, and high-quality neural machine translation models.

pdf bib
Integrating Domain Terminology into Neural Machine Translation
Elise Michon | Josep Crego | Jean Senellart
Proceedings of the 28th International Conference on Computational Linguistics

This paper extends existing work on terminology integration into Neural Machine Translation, a common industrial practice to dynamically adapt translation to a specific domain. Our method, based on the use of placeholders complemented with morphosyntactic annotation, efficiently taps into the ability of the neural network to deal with symbolic knowledge to surpass the surface generalization shown by alternative techniques. We compare our approach to state-of-the-art systems and benchmark them through a well-defined evaluation framework, focusing on actual application of terminology and not just on the overall performance. Results indicate the suitability of our method in the use-case where terminology is used in a system trained on generic data only.

2019

pdf bib
SYSTRAN @ WAT 2019: Russian-Japanese News Commentary task
Jitao Xu | TuAnh Nguyen | MinhQuang Pham | Josep Crego | Jean Senellart
Proceedings of the 6th Workshop on Asian Translation

This paper describes Systran’s submissions to WAT 2019 Russian-Japanese News Commentary task. A challenging translation task due to the extremely low resources available and the distance of the language pair. We have used the neural Transformer architecture learned over the provided resources and we carried out synthetic data generation experiments which aim at alleviating the data scarcity problem. Results indicate the suitability of the data augmentation experiments, enabling our systems to rank first according to automatic evaluations.

pdf bib
Enhanced Transformer Model for Data-to-Text Generation
Li Gong | Josep Crego | Jean Senellart
Proceedings of the 3rd Workshop on Neural Generation and Translation

Neural models have recently shown significant progress on data-to-text generation tasks in which descriptive texts are generated conditioned on database records. In this work, we present a new Transformer-based data-to-text generation model which learns content selection and summary generation in an end-to-end fashion. We introduce two extensions to the baseline transformer model: First, we modify the latent representation of the input, which helps to significantly improve the content correctness of the output summary; Second, we include an additional learning objective that accounts for content selection modelling. In addition, we propose two data augmentation methods that succeed to further improve performance of the resulting generation models. Evaluation experiments show that our final model outperforms current state-of-the-art systems as measured by different metrics: BLEU, content selection precision and content ordering. We made publicly available the transformer extension presented in this paper.

pdf bib
SYSTRAN @ WNGT 2019: DGT Task
Li Gong | Josep Crego | Jean Senellart
Proceedings of the 3rd Workshop on Neural Generation and Translation

This paper describes SYSTRAN participation to the Document-level Generation and Trans- lation (DGT) Shared Task of the 3rd Workshop on Neural Generation and Translation (WNGT 2019). We participate for the first time using a Transformer network enhanced with modified input embeddings and optimising an additional objective function that considers content selection. The network takes in structured data of basketball games and outputs a summary of the game in natural language.

2018

pdf bib
Fixing Translation Divergences in Parallel Corpora for Neural MT
MinhQuang Pham | Josep Crego | Jean Senellart | François Yvon
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Corpus-based approaches to machine translation rely on the availability of clean parallel corpora. Such resources are scarce, and because of the automatic processes involved in their preparation, they are often noisy. This paper describes an unsupervised method for detecting translation divergences in parallel sentences. We rely on a neural network that computes cross-lingual sentence similarity scores, which are then used to effectively filter out divergent translations. Furthermore, similarity scores predicted by the network are used to identify and fix some partial divergences, yielding additional parallel segments. We evaluate these methods for English-French and English-German machine translation tasks, and show that using filtered/corrected corpora actually improves MT performance.

pdf bib
OpenNMT System Description for WNMT 2018: 800 words/sec on a single-core CPU
Jean Senellart | Dakun Zhang | Bo Wang | Guillaume Klein | Jean-Pierre Ramatchandirin | Josep Crego | Alexander Rush
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

We present a system description of the OpenNMT Neural Machine Translation entry for the WNMT 2018 evaluation. In this work, we developed a heavily optimized NMT inference model targeting a high-performance CPU system. The final system uses a combination of four techniques, all of them lead to significant speed-ups in combination: (a) sequence distillation, (b) architecture modifications, (c) precomputation, particularly of vocabulary, and (d) CPU targeted quantization. This work achieves the fastest performance of the shared task, and led to the development of new features that have been integrated to OpenNMT and available to the community.

pdf bib
Neural Network Architectures for Arabic Dialect Identification
Elise Michon | Minh Quang Pham | Josep Crego | Jean Senellart
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

SYSTRAN competes this year for the first time to the DSL shared task, in the Arabic Dialect Identification subtask. We participate by training several Neural Network models showing that we can obtain competitive results despite the limited amount of training data available for learning. We report our experiments and detail the network architecture and parameters of our 3 runs: our best performing system consists in a Multi-Input CNN that learns separate embeddings for lexical, phonetic and acoustic input features (F1: 0.5289); we also built a CNN-biLSTM network aimed at capturing both spatial and sequential features directly from speech spectrograms (F1: 0.3894 at submission time, F1: 0.4235 with later found parameters); and finally a system relying on binary CNN-biLSTMs (F1: 0.4339).

pdf bib
SYSTRAN Participation to the WMT2018 Shared Task on Parallel Corpus Filtering
MinhQuang Pham | Josep Crego | Jean Senellart
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the participation of SYSTRAN to the shared task on parallel corpus filtering at the Third Conference on Machine Translation (WMT 2018). We participate for the first time using a neural sentence similarity classifier which aims at predicting the relatedness of sentence pairs in a multilingual context. The paper describes the main characteristics of our approach and discusses the results obtained on the data sets published for the shared task.

2017

pdf bib
Adaptation incrémentale de modèles de traduction neuronaux (Incremental adaptation of neural machine translation models)
Christophe Servan | Josep Crego | Jean Senellart
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

L’adaptation au domaine est un verrou scientifique en traduction automatique. Il englobe généralement l’adaptation de la terminologie et du style, en particulier pour la post-édition humaine dans le cadre d’une traduction assistée par ordinateur. Avec la traduction automatique neuronale, nous étudions une nouvelle approche d’adaptation au domaine que nous appelons “spécialisation” et qui présente des résultats prometteurs tant dans la vitesse d’apprentissage que dans les scores de traduction. Dans cet article, nous proposons d’explorer cette approche.

pdf bib
Conception d’une solution de détection d’événements basée sur Twitter (Design of a solution for event detection from Tweeter)
Christophe Servan | Catherine Kobus | Yongchao Deng | Cyril Touffet | Jungi Kim | Inès Kapp | Djamel Mostefa | Josep Crego | Aurélien Coquard | Jean Senellart
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 3 - Démonstrations

Cet article présente un système d’alertes fondé sur la masse de données issues de Tweeter. L’objectif de l’outil est de surveiller l’actualité, autour de différents domaines témoin incluant les événements sportifs ou les catastrophes naturelles. Cette surveillance est transmise à l’utilisateur sous forme d’une interface web contenant la liste d’événements localisés sur une carte.

pdf bib
Boosting Neural Machine Translation
Dakun Zhang | Jungi Kim | Josep Crego | Jean Senellart
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Training efficiency is one of the main problems for Neural Machine Translation (NMT). Deep networks need for very large data as well as many training iterations to achieve state-of-the-art performance. This results in very high computation cost, slowing down research and industrialisation. In this paper, we propose to alleviate this problem with several training methods based on data boosting and bootstrap with no modifications to the neural network. It imitates the learning process of humans, which typically spend more time when learning “difficult” concepts than easier ones. We experiment on an English-French translation task showing accuracy improvements of up to 1.63 BLEU while saving 20% of training time.

pdf bib
Domain Control for Neural Machine Translation
Catherine Kobus | Josep Crego | Jean Senellart
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Machine translation systems are very sensitive to the domains they were trained on. Several domain adaptation techniques have already been deeply studied. We propose a new technique for neural machine translation (NMT) that we call domain control which is performed at runtime using a unique neural network covering multiple domains. The presented approach shows quality improvements when compared to dedicated domains translating on any of the covered domains and even on out-of-domain data. In addition, model parameters do not need to be re-estimated for each domain, making this effective to real use cases. Evaluation is carried out on English-to-French translation for two different testing scenarios. We first consider the case where an end-user performs translations on a known domain. Secondly, we consider the scenario where the domain is not known and predicted at the sentence level before translating. Results show consistent accuracy improvements for both conditions.

pdf bib
SYSTRAN Purely Neural MT Engines for WMT2017
Yongchao Deng | Jungi Kim | Guillaume Klein | Catherine Kobus | Natalia Segal | Christophe Servan | Bo Wang | Dakun Zhang | Josep Crego | Jean Senellart
Proceedings of the Second Conference on Machine Translation

2012

pdf bib
Joint WMT 2012 Submission of the QUAERO Project
Markus Freitag | Stephan Peitz | Matthias Huck | Hermann Ney | Jan Niehues | Teresa Herrmann | Alex Waibel | Hai-son Le | Thomas Lavergne | Alexandre Allauzen | Bianka Buschbeck | Josep Maria Crego | Jean Senellart
Proceedings of the Seventh Workshop on Statistical Machine Translation

2011

pdf bib
LIMSI @ WMT11
Alexandre Allauzen | Hélène Bonneau-Maynard | Hai-Son Le | Aurélien Max | Guillaume Wisniewski | François Yvon | Gilles Adda | Josep Maria Crego | Adrien Lardilleux | Thomas Lavergne | Artem Sokolov
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Joint WMT Submission of the QUAERO Project
Markus Freitag | Gregor Leusch | Joern Wuebker | Stephan Peitz | Hermann Ney | Teresa Herrmann | Jan Niehues | Alex Waibel | Alexandre Allauzen | Gilles Adda | Josep Maria Crego | Bianka Buschbeck | Tonio Wandmacher | Jean Senellart
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
From n-gram-based to CRF-based Translation Models
Thomas Lavergne | Alexandre Allauzen | Josep Maria Crego | François Yvon
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf bib
LIMSI’s Statistical Translation Systems for WMT’10
Alexandre Allauzen | Josep M. Crego | İlknur Durgar El-Kahlout | François Yvon
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Local lexical adaptation in Machine Translation through triangulation: SMT helping SMT
Josep Maria Crego | Aurélien Max | François Yvon
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Improving Reordering with Linguistically Informed Bilingual n-grams
Josep Maria Crego | François Yvon
Coling 2010: Posters

pdf bib
Contrastive Lexical Evaluation of Machine Translation
Aurélien Max | Josep Maria Crego | François Yvon
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper advocates a complementary measure of translation performance that focuses on the constrastive ability of two or more systems or system versions to adequately translate source words. This is motivated by three main reasons : 1) existing automatic metrics sometimes do not show significant differences that can be revealed by fine-grained focussed human evaluation, 2) these metrics are based on direct comparisons between system hypotheses with the corresponding reference translations, thus ignoring the input words that were actually translated, and 3) as these metrics do not take input hypotheses from several systems at once, fine-grained contrastive evaluation can only be done indirectly. This proposal is illustrated on a multi-source Machine Translation scenario where multiple translations of a source text are available. Significant gains (up to +1.3 BLEU point) are achieved on these experiments, and contrastive lexical evaluation is shown to provide new information that can help to better analyse a system's performance.

2009

pdf bib
LIMSI‘s Statistical Translation Systems for WMT‘09
Alexandre Allauzen | Josep Crego | Aurélien Max | François Yvon
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Gappy Translation Units under Left-to-Right SMT Decoding
Josep M. Crego | François Yvon
Proceedings of the 13th Annual conference of the European Association for Machine Translation

2008

pdf bib
Using Shallow Syntax Information to Improve Word Alignment and Reordering for SMT
Josep M. Crego | Nizar Habash
Proceedings of the Third Workshop on Statistical Machine Translation

pdf bib
The TALP-UPC Ngram-Based Statistical Machine Translation System for ACL-WMT 2008
Maxim Khalilov | Adolfo Hernández H. | Marta R. Costa-jussà | Josep M. Crego | Carlos A. Henríquez Q. | Patrik Lambert | José A. R. Fonollosa | José B. Mariño | Rafael E. Banchs
Proceedings of the Third Workshop on Statistical Machine Translation

2007

pdf bib
Discriminative Alignment Training without Annotated Data for Machine Translation
Patrik Lambert | Rafael E. Banchs | Josep M. Crego
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
Analysis and System Combination of Phrase- and N-Gram-Based Statistical Machine Translation Systems
Marta R. Costa-jussà | Josep M. Crego | David Vilar | José A. R. Fonollosa | José B. Mariño | Hermann Ney
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
Ngram-Based Statistical Machine Translation Enhanced with Multiple Weighted Reordering Hypotheses
Marta R. Costa-jussà | Josep M. Crego | Patrik Lambert | Maxim Khalilov | José A. R. Fonollosa | José B. Mariño | Rafael E. Banchs
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
Extending MARIE: an N-gram-based SMT decoder
Josep M. Crego | José B. Mariño
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

pdf bib
N-gram-based Machine Translation
José Mariño | Rafael E. Banchs | Josep M. Crego | Adrià de Gispert | Patrik Lambert | José A. R. Fonollosa | Marta R. Costa-jussà
Computational Linguistics, Volume 32, Number 4, December 2006

pdf bib
TALP Phrase-based statistical translation system for European language pairs
Marta R. Costa-jussà | Josep M. Crego | Adrià de Gispert | Patrik Lambert | Maxim Khalilov | José B. Mariño | José A. R. Fonollosa | Rafael Banchs
Proceedings on the Workshop on Statistical Machine Translation

pdf bib
N-gram-based SMT System Enhanced with Reordering Patterns
Josep M. Crego | Adrià de Gispert | Patrik Lambert | Marta R. Costa-jussà | Maxim Khalilov | Rafael Banchs | José B. Mariño | José A. R. Fonollosa
Proceedings on the Workshop on Statistical Machine Translation

2005

pdf bib
Statistical Machine Translation of Euparl Data by using Bilingual N-grams
Rafael E. Banchs | Josep M. Crego | Adrià de Gispert | Patrik Lambert | José B. Mariño
Proceedings of the ACL Workshop on Building and Using Parallel Texts

2004

pdf bib
Bilingual Connections for Trilingual Corpora: An XML Approach
Victoria Arranz | Núria Castell | Josep Maria Crego | Jesús Giménez | Adrià de Gispert | Patrik Lambert
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)