Alexis Conneau


pdf bib
Emerging Cross-lingual Structure in Pretrained Language Models
Alexis Conneau | Shijie Wu | Haoran Li | Luke Zettlemoyer | Veselin Stoyanov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We study the problem of multilingual masked language modeling, i.e. the training of a single model on concatenated text from multiple languages, and present a detailed study of several factors that influence why these models are so effective for cross-lingual transfer. We show, contrary to what was previously hypothesized, that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains. The only requirement is that there are some shared parameters in the top layers of the multi-lingual encoder. To better understand this result, we also show that representations from monolingual BERT models in different languages can be aligned post-hoc quite effectively, strongly suggesting that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces. For multilingual masked language modeling, these symmetries are automatically discovered and aligned during the joint training process.

pdf bib
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau | Kartikay Khandelwal | Naman Goyal | Vishrav Chaudhary | Guillaume Wenzek | Francisco Guzmán | Edouard Grave | Myle Ott | Luke Zettlemoyer | Veselin Stoyanov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code and models publicly available.

pdf bib
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
Guillaume Wenzek | Marie-Anne Lachaux | Alexis Conneau | Vishrav Chaudhary | Francisco Guzmán | Armand Joulin | Edouard Grave
Proceedings of the 12th Language Resources and Evaluation Conference

Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.


pdf bib
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)
Isabelle Augenstein | Spandana Gella | Sebastian Ruder | Katharina Kann | Burcu Can | Johannes Welbl | Alexis Conneau | Xiang Ren | Marek Rei
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)


pdf bib
SentEval: An Evaluation Toolkit for Universal Sentence Representations
Alexis Conneau | Douwe Kiela
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Learning Visually Grounded Sentence Representations
Douwe Kiela | Alexis Conneau | Allan Jabri | Maximilian Nickel
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We investigate grounded sentence representations, where we train a sentence encoder to predict the image features of a given caption—i.e., we try to “imagine” how a sentence would be depicted visually—and use the resultant features as sentence representations. We examine the quality of the learned representations on a variety of standard sentence representation quality benchmarks, showing improved performance for grounded models over non-grounded ones. In addition, we thoroughly analyze the extent to which grounding contributes to improved performance, and show that the system also learns improved word embeddings.

pdf bib
XNLI: Evaluating Cross-lingual Sentence Representations
Alexis Conneau | Ruty Rinott | Guillaume Lample | Adina Williams | Samuel Bowman | Holger Schwenk | Veselin Stoyanov
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in cross-lingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 14 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.

pdf bib
Phrase-Based & Neural Unsupervised Machine Translation
Guillaume Lample | Myle Ott | Alexis Conneau | Ludovic Denoyer | Marc’Aurelio Ranzato
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Machine translation systems achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large amounts of parallel sentences, which hinders their applicability to the majority of language pairs. This work investigates how to learn to translate when having access to only large monolingual corpora in each language. We propose two model variants, a neural and a phrase-based model. Both versions leverage a careful initialization of the parameters, the denoising effect of language models and automatic generation of parallel data by iterative back-translation. These models are significantly better than methods from the literature, while being simpler and having fewer hyper-parameters. On the widely used WMT’14 English-French and WMT’16 German-English benchmarks, our models respectively obtain 28.1 and 25.2 BLEU points without using a single parallel sentence, outperforming the state of the art by more than 11 BLEU points. On low-resource languages like English-Urdu and English-Romanian, our methods achieve even better results than semi-supervised and supervised approaches leveraging the paucity of available bitexts. Our code for NMT and PBSMT is publicly available.

pdf bib
What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties
Alexis Conneau | German Kruszewski | Guillaume Lample | Loïc Barrault | Marco Baroni
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Although much effort has recently been devoted to training high-quality sentence embeddings, we still have a poor understanding of what they are capturing. “Downstream” tasks, often based on sentence classification, are commonly used to evaluate the quality of sentence representations. The complexity of the tasks makes it however difficult to infer what kind of information is present in the representations. We introduce here 10 probing tasks designed to capture simple linguistic features of sentences, and we use them to study embeddings generated by three different encoders trained in eight distinct ways, uncovering intriguing properties of both encoders and training methods.


pdf bib
Very Deep Convolutional Networks for Text Classification
Alexis Conneau | Holger Schwenk | Loïc Barrault | Yann Lecun
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

The dominant approach for many NLP tasks are recurrent neural networks, in particular LSTMs, and convolutional neural networks. However, these architectures are rather shallow in comparison to the deep convolutional networks which have pushed the state-of-the-art in computer vision. We present a new architecture (VDCNN) for text processing which operates directly at the character level and uses only small convolutions and pooling operations. We are able to show that the performance of this model increases with the depth: using up to 29 convolutional layers, we report improvements over the state-of-the-art on several public text classification tasks. To the best of our knowledge, this is the first time that very deep convolutional nets have been applied to text processing.

pdf bib
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Alexis Conneau | Douwe Kiela | Holger Schwenk | Loïc Barrault | Antoine Bordes
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks. Our encoder is publicly available.