Holger Schwenk


2020

pdf bib
MLQA: Evaluating Cross-lingual Extractive Question Answering
Patrick Lewis | Barlas Oguz | Ruty Rinott | Sebastian Riedel | Holger Schwenk
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Question answering (QA) models have shown rapid progress enabled by the availability of large, high-quality benchmark datasets. Such annotated datasets are difficult and costly to collect, and rarely exist in languages other than English, making building QA systems that work well in other languages challenging. In order to develop such systems, it is crucial to invest in high quality multilingual evaluation benchmarks to measure progress. We present MLQA, a multi-way aligned extractive QA evaluation benchmark intended to spur research in this area. MLQA contains QA instances in 7 languages, English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA has over 12K instances in English and 5K in each other language, with each instance parallel between 4 languages on average. We evaluate state-of-the-art cross-lingual models and machine-translation-based baselines on MLQA. In all cases, transfer results are shown to be significantly behind training-language performance.

2019

pdf bib
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
Mikel Artetxe | Holger Schwenk
Transactions of the Association for Computational Linguistics, Volume 7

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our experiments in cross-lingual natural language inference (XNLI data set), cross-lingual document classification (MLDoc data set), and parallel corpus mining (BUCC data set) show the effectiveness of our approach. We also introduce a new test set of aligned sentences in 112 languages, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low- resource languages. Our implementation, the pre-trained encoder, and the multilingual test set are available at https://github.com/facebookresearch/LASER.

pdf bib
Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
Mikel Artetxe | Holger Schwenk
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we propose a new method for this task based on multilingual sentence embeddings. In contrast to previous approaches, which rely on nearest neighbor retrieval with a hard threshold over cosine similarity, our proposed method accounts for the scale inconsistencies of this measure, considering the margin between a given sentence pair and its closest candidates instead. Our experiments show large improvements over existing methods. We outperform the best published results on the BUCC mining task and the UN reconstruction task by more than 10 F1 and 30 precision points, respectively. Filtering the English-German ParaCrawl corpus with our approach, we obtain 31.2 BLEU points on newstest2014, an improvement of more than one point over the best official filtered version.

pdf bib
Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings
Vishrav Chaudhary | Yuqing Tang | Francisco Guzmán | Holger Schwenk | Philipp Koehn
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In this paper, we describe our submission to the WMT19 low-resource parallel corpus filtering shared task. Our main approach is based on the LASER toolkit (Language-Agnostic SEntence Representations), which uses an encoder-decoder architecture trained on a parallel corpus to obtain multilingual sentence representations. We then use the representations directly to score and filter the noisy parallel sentences without additionally training a scoring function. We contrast our approach to other promising methods and show that LASER yields strong results. Finally, we produce an ensemble of different scoring methods and obtain additional gains. Our submission achieved the best overall performance for both the Nepali-English and Sinhala-English 1M tasks by a margin of 1.3 and 1.4 BLEU respectively, as compared to the second best systems. Moreover, our experiments show that this technique is promising for low and even no-resource scenarios.

2018

pdf bib
A Corpus for Multilingual Document Classification in Eight Languages
Holger Schwenk | Xian Li
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
XNLI: Evaluating Cross-lingual Sentence Representations
Alexis Conneau | Ruty Rinott | Guillaume Lample | Adina Williams | Samuel Bowman | Holger Schwenk | Veselin Stoyanov
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in cross-lingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 14 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.

pdf bib
Filtering and Mining Parallel Data in a Joint Multilingual Space
Holger Schwenk
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We learn a joint multilingual sentence embedding and use the distance between sentences in different languages to filter noisy parallel data and to mine for parallel data in large news collections. We are able to improve a competitive baseline on the WMT’14 English to German task by 0.3 BLEU by filtering out 25% of the training data. The same approach is used to mine additional bitexts for the WMT’14 system and to obtain competitive results on the BUCC shared task to identify parallel sentences in comparable corpora. The approach is generic, it can be applied to many language pairs and it is independent of the architecture of the machine translation system.

2017

pdf bib
Very Deep Convolutional Networks for Text Classification
Alexis Conneau | Holger Schwenk | Loïc Barrault | Yann Lecun
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

The dominant approach for many NLP tasks are recurrent neural networks, in particular LSTMs, and convolutional neural networks. However, these architectures are rather shallow in comparison to the deep convolutional networks which have pushed the state-of-the-art in computer vision. We present a new architecture (VDCNN) for text processing which operates directly at the character level and uses only small convolutions and pooling operations. We are able to show that the performance of this model increases with the depth: using up to 29 convolutional layers, we report improvements over the state-of-the-art on several public text classification tasks. To the best of our knowledge, this is the first time that very deep convolutional nets have been applied to text processing.

pdf bib
Learning Joint Multilingual Sentence Representations with Neural Machine Translation
Holger Schwenk | Matthijs Douze
Proceedings of the 2nd Workshop on Representation Learning for NLP

In this paper, we use the framework of neural machine translation to learn joint sentence representations across six very different languages. Our aim is that a representation which is independent of the language, is likely to capture the underlying semantics. We define a new cross-lingual similarity measure, compare up to 1.4M sentence representations and study the characteristics of close sentences. We provide experimental evidence that sentences that are close in embedding space are indeed semantically highly related, but often have quite different structure and syntax. These relations also hold when comparing sentences in different languages.

pdf bib
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Alexis Conneau | Douwe Kiela | Holger Schwenk | Loïc Barrault | Antoine Bordes
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks. Our encoder is publicly available.

2015

pdf bib
Incremental Adaptation Strategies for Neural Network Language Models
Alex Ter-Sarkisov | Holger Schwenk | Fethi Bougares | Loïc Barrault
Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality

pdf bib
Continuous Adaptation to User Feedback for Statistical Machine Translation
Frédéric Blain | Fethi Bougares | Amir Hazem | Loïc Barrault | Holger Schwenk
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
The MateCat Tool
Marcello Federico | Nicola Bertoldi | Mauro Cettolo | Matteo Negri | Marco Turchi | Marco Trombetti | Alessandro Cattelan | Antonio Farina | Domenico Lupinetti | Andrea Martines | Alberto Massidda | Holger Schwenk | Loïc Barrault | Frederic Blain | Philipp Koehn | Christian Buck | Ulrich Germann
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations

pdf bib
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
Kyunghyun Cho | Bart van Merriënboer | Caglar Gulcehre | Dzmitry Bahdanau | Fethi Bougares | Holger Schwenk | Yoshua Bengio
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Multimodal Comparable Corpora as Resources for Extracting Parallel Data: Parallel Phrases Extraction
Haithem Afli | Loïc Barrault | Holger Schwenk
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
A Multi-Domain Translation Model Framework for Statistical Machine Translation
Rico Sennrich | Holger Schwenk | Walid Aransa
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib
Traduction automatique à partir de corpus comparables: extraction de phrases parallèles à partir de données comparables multimodales (Automatic Translation from Comparable corpora : extracting parallel sentences from multimodal comparable corpora) [in French]
Haithem Afli | Loïc Barrault | Holger Schwenk
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

pdf bib
Automatic Translation of Scientific Documents in the HAL Archive
Patrik Lambert | Holger Schwenk | Frédéric Blain
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the development of a statistical machine translation system between French and English for scientific papers. This system will be closely integrated into the French HAL open archive, a collection of more than 100.000 scientific papers. We describe the creation of in-domain parallel and monolingual corpora, the development of a domain specific translation system with the created resources, and its adaptation using monolingual resources only. These techniques allowed us to improve a generic system by more than 10 BLEU points.

pdf bib
Large, Pruned or Continuous Space Language Models on a GPU for Statistical Machine Translation
Holger Schwenk | Anthony Rousseau | Mohammed Attik
Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT

pdf bib
LIUM’s SMT Machine Translation Systems for WMT 2012
Christophe Servan | Patrik Lambert | Anthony Rousseau | Holger Schwenk | Loïc Barrault
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
Collaborative Machine Translation Service for Scientific texts
Patrik Lambert | Jean Senellart | Laurent Romary | Holger Schwenk | Florian Zipser | Patrice Lopez | Frédéric Blain
Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Continuous Space Translation Models for Phrase-Based Statistical Machine Translation
Holger Schwenk
Proceedings of COLING 2012: Posters

2011

pdf bib
Investigations on Translation Model Adaptation Using Monolingual Data
Patrik Lambert | Holger Schwenk | Christophe Servan | Sadaf Abdul-Rauf
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
LIUM’s SMT Machine Translation Systems for WMT 2011
Holger Schwenk | Patrik Lambert | Loïc Barrault | Christophe Servan | Sadaf Abdul-Rauf | Haithem Afli | Kashif Shah
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Parametric Weighting of Parallel Data for Statistical Machine Translation
Kashif Shah | Loïc Barrault | Holger Schwenk
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
LIUM SMT Machine Translation System for WMT 2010
Patrik Lambert | Sadaf Abdul-Rauf | Holger Schwenk
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Translation Model Adaptation by Resampling
Kashif Shah | Loïc Barrault | Holger Schwenk
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib
SMT and SPE Machine Translation Systems for WMT‘09
Holger Schwenk | Sadaf Abdul-Rauf | Loïc Barrault | Jean Senellart
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Exploiting Comparable Corpora with TER and TERp
Sadaf Abdul-Rauf | Holger Schwenk
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

pdf bib
On the Use of Comparable Corpora to Improve SMT performance
Sadaf Abdul-Rauf | Holger Schwenk
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

pdf bib
Large and Diverse Language Models for Statistical Machine Translation
Holger Schwenk | Philipp Koehn
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib
First Steps towards a General Purpose French/English Statistical Machine Translation System
Holger Schwenk | Jean-Baptiste Fouet | Jean Senellart
Proceedings of the Third Workshop on Statistical Machine Translation

2007

pdf bib
Combining Morphosyntactic Enriched Representation with n-best Reranking in Statistical Translation
Hélène Bonneau-Maynard | Alexandre Allauzen | Daniel Déchelotte | Holger Schwenk
Proceedings of SSST, NAACL-HLT 2007 / AMTA Workshop on Syntax and Structure in Statistical Translation

pdf bib
Building a Statistical Machine Translation System for French Using the Europarl Corpus
Holger Schwenk
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
Smooth Bilingual N-Gram Translation
Holger Schwenk | Marta R. Costa-jussà | Jose A. R. Fonollosa
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib
Continuous Space Language Models for Statistical Machine Translation
Holger Schwenk | Daniel Dechelotte | Jean-Luc Gauvain
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
Training Neural Network Language Models on Very Large Corpora
Holger Schwenk | Jean-Luc Gauvain
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing