Myle Ott


pdf bib
Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art
Patrick Lewis | Myle Ott | Jingfei Du | Veselin Stoyanov
Proceedings of the 3rd Clinical Natural Language Processing Workshop

A large array of pretrained models are available to the biomedical NLP (BioNLP) community. Finding the best model for a particular task can be difficult and time-consuming. For many applications in the biomedical and clinical domains, it is crucial that models can be built quickly and are highly accurate. We present a large-scale study across 18 established biomedical and clinical NLP tasks to determine which of several popular open-source biomedical and clinical NLP models work well in different settings. Furthermore, we apply recent advances in pretraining to train new biomedical language models, and carefully investigate the effect of various design choices on downstream performance. Our best models perform well in all of our benchmarks, and set new State-of-the-Art in 9 tasks. We release these models in the hope that they can help the community to speed up and increase the accuracy of BioNLP and text mining applications.

pdf bib
On The Evaluation of Machine Translation Systems Trained With Back-Translation
Sergey Edunov | Myle Ott | Marc’Aurelio Ranzato | Michael Auli
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Back-translation is a widely used data augmentation technique which leverages target monolingual data. However, its effectiveness has been challenged since automatic metrics such as BLEU only show significant improvements for test examples where the source itself is a translation, or translationese. This is believed to be due to translationese inputs better matching the back-translated training data. In this work, we show that this conjecture is not empirically supported and that back-translation improves translation quality of both naturally occurring text as well as translationese according to professional human translators. We provide empirical evidence to support the view that back-translation is preferred by humans because it produces more fluent outputs. BLEU cannot capture human preferences because references are translationese when source sentences are natural text. We recommend complementing BLEU with a language model score to measure fluency.

pdf bib
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau | Kartikay Khandelwal | Naman Goyal | Vishrav Chaudhary | Guillaume Wenzek | Francisco Guzmán | Edouard Grave | Myle Ott | Luke Zettlemoyer | Veselin Stoyanov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code and models publicly available.

pdf bib
How Decoding Strategies Affect the Verifiability of Generated Text
Luca Massarelli | Fabio Petroni | Aleksandra Piktus | Myle Ott | Tim Rocktäschel | Vassilis Plachouras | Fabrizio Silvestri | Sebastian Riedel
Findings of the Association for Computational Linguistics: EMNLP 2020

Recent progress in pre-trained language models led to systems that are able to generate text of an increasingly high quality. While several works have investigated the fluency and grammatical correctness of such models, it is still unclear to which extent the generated text is consistent with factual world knowledge. Here, we go beyond fluency and also investigate the verifiability of text generated by state-of-the-art pre-trained language models. A generated sentence is verifiable if it can be corroborated or disproved by Wikipedia, and we find that the verifiability of generated text strongly depends on the decoding strategy. In particular, we discover a tradeoff between factuality (i.e., the ability of generating Wikipedia corroborated text) and repetitiveness. While decoding strategies such as top-k and nucleus sampling lead to less repetitive generations, they also produce less verifiable text. Based on these finding, we introduce a simple and effective decoding strategy which, in comparison to previously used decoding strategies, produces less repetitive and more verifiable text.

pdf bib
General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference
Jingfei Du | Myle Ott | Haoran Li | Xing Zhou | Veselin Stoyanov
Findings of the Association for Computational Linguistics: EMNLP 2020

The state of the art on many NLP tasks is currently achieved by large pre-trained language models, which require a considerable amount of computation. We aim to reduce the inference cost in a setting where many different predictions are made on a single piece of text. In that case, computational cost during inference can be amortized over the different predictions (tasks) using a shared text encoder. We compare approaches for training such an encoder and show that encoders pre-trained over multiple tasks generalize well to unseen tasks. We also compare ways of extracting fixed- and limited-size representations from this encoder, including pooling features extracted from multiple layers or positions. Our best approach compares favorably to knowledge distillation, achieving higher accuracy and lower computational cost once the system is handling around 7 tasks. Further, we show that through binary quantization, we can reduce the size of the extracted representations by a factor of 16 to store them for later use. The resulting method offers a compelling solution for using large-scale pre-trained models at a fraction of the computational cost when multiple tasks are performed on the same text.


pdf bib
The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English
Francisco Guzmán | Peng-Jen Chen | Myle Ott | Juan Pino | Guillaume Lample | Philipp Koehn | Vishrav Chaudhary | Marc’Aurelio Ranzato
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

For machine translation, a vast majority of language pairs in the world are considered low-resource because they have little parallel data available. Besides the technical challenges of learning with limited supervision, it is difficult to evaluate methods trained on low-resource language pairs because of the lack of freely and publicly available benchmarks. In this work, we introduce the FLORES evaluation datasets for Nepali–English and Sinhala– English, based on sentences translated from Wikipedia. Compared to English, these are languages with very different morphology and syntax, for which little out-of-domain parallel data is available and for which relatively large amounts of monolingual data are freely available. We describe our process to collect and cross-check the quality of translations, and we report baseline performance using several learning settings: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. Our experiments demonstrate that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT. Data and code to reproduce our experiments are available at

pdf bib
Facebook AI’s WAT19 Myanmar-English Translation Task Submission
Peng-Jen Chen | Jiajun Shen | Matthew Le | Vishrav Chaudhary | Ahmed El-Kishky | Guillaume Wenzek | Myle Ott | Marc’Aurelio Ranzato
Proceedings of the 6th Workshop on Asian Translation

This paper describes Facebook AI’s submission to the WAT 2019 Myanmar-English translation task. Our baseline systems are BPE-based transformer models. We explore methods to leverage monolingual data to improve generalization, including self-training, back-translation and their combination. We further improve results by using noisy channel re-ranking and ensembling. We demonstrate that these techniques can significantly improve not only a system trained with additional monolingual data, but even the baseline system trained exclusively on the provided small parallel dataset. Our system ranks first in both directions according to human evaluation and BLEU, with a gain of over 8 BLEU points above the second best system.

pdf bib
Facebook FAIR’s WMT19 News Translation Task Submission
Nathan Ng | Kyra Yee | Alexei Baevski | Myle Ott | Michael Auli | Sergey Edunov
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes Facebook FAIR’s submission to the WMT19 shared news translation task. We participate in four language directions, English <-> German and English <-> Russian in both directions. Following our submission from last year, our baseline systems are large BPE-based transformer models trained with the FAIRSEQ sequence modeling toolkit. This year we experiment with different bitext data filtering schemes, as well as with adding filtered back-translated data. We also ensemble and fine-tune our models on domain-specific data, then decode using noisy channel model reranking. Our system improves on our previous system’s performance by 4.5 BLEU points and achieves the best case-sensitive BLEU score for the translation direction English→Russian.

pdf bib
fairseq: A Fast, Extensible Toolkit for Sequence Modeling
Myle Ott | Sergey Edunov | Alexei Baevski | Angela Fan | Sam Gross | Nathan Ng | David Grangier | Michael Auli
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs. A demo video can be found at


pdf bib
Classical Structured Prediction Losses for Sequence to Sequence Learning
Sergey Edunov | Myle Ott | Michael Auli | David Grangier | Marc’Aurelio Ranzato
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

There has been much recent work on training neural attention models at the sequence-level using either reinforcement learning-style methods or by optimizing the beam. In this paper, we survey a range of classical objective functions that have been widely used to train linear models for structured prediction and apply them to neural sequence to sequence models. Our experiments show that these losses can perform surprisingly well by slightly outperforming beam search optimization in a like for like setup. We also report new state of the art results on both IWSLT’14 German-English translation as well as Gigaword abstractive summarization. On the large WMT’14 English-French task, sequence-level training achieves 41.5 BLEU which is on par with the state of the art.

pdf bib
Understanding Back-Translation at Scale
Sergey Edunov | Myle Ott | Michael Auli | David Grangier
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

An effective method to improve neural machine translation with monolingual data is to augment the parallel training corpus with back-translations of target language sentences. This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences. We find that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective. Our analysis shows that sampling or noisy synthetic data gives a much stronger training signal than data generated by beam or greedy search. We also compare how synthetic data compares to genuine bitext and study various domain effects. Finally, we scale to hundreds of millions of monolingual sentences and achieve a new state of the art of 35 BLEU on the WMT’14 English-German test set.

pdf bib
Phrase-Based & Neural Unsupervised Machine Translation
Guillaume Lample | Myle Ott | Alexis Conneau | Ludovic Denoyer | Marc’Aurelio Ranzato
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Machine translation systems achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large amounts of parallel sentences, which hinders their applicability to the majority of language pairs. This work investigates how to learn to translate when having access to only large monolingual corpora in each language. We propose two model variants, a neural and a phrase-based model. Both versions leverage a careful initialization of the parameters, the denoising effect of language models and automatic generation of parallel data by iterative back-translation. These models are significantly better than methods from the literature, while being simpler and having fewer hyper-parameters. On the widely used WMT’14 English-French and WMT’16 German-English benchmarks, our models respectively obtain 28.1 and 25.2 BLEU points without using a single parallel sentence, outperforming the state of the art by more than 11 BLEU points. On low-resource languages like English-Urdu and English-Romanian, our methods achieve even better results than semi-supervised and supervised approaches leveraging the paucity of available bitexts. Our code for NMT and PBSMT is publicly available.

pdf bib
Scaling Neural Machine Translation
Myle Ott | Sergey Edunov | David Grangier | Michael Auli
Proceedings of the Third Conference on Machine Translation: Research Papers

Sequence to sequence learning models still require several days to reach state of the art performance on large benchmark datasets using a single machine. This paper shows that reduced precision and large batch training can speedup training by nearly 5x on a single 8-GPU machine with careful tuning and implementation. On WMT’14 English-German translation, we match the accuracy of Vaswani et al. (2017) in under 5 hours when training on 8 GPUs and we obtain a new state of the art of 29.3 BLEU after training for 85 minutes on 128 GPUs. We further improve these results to 29.8 BLEU by training on the much larger Paracrawl dataset. On the WMT’14 English-French task, we obtain a state-of-the-art BLEU of 43.2 in 8.5 hours on 128 GPUs.


pdf bib
Towards a General Rule for Identifying Deceptive Opinion Spam
Jiwei Li | Myle Ott | Claire Cardie | Eduard Hovy
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Linguistic Models of Deceptive Opinion Spam
Myle Ott
Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis


pdf bib
Identifying Manipulated Offerings on Review Portals
Jiwei Li | Myle Ott | Claire Cardie
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Negative Deceptive Opinion Spam
Myle Ott | Claire Cardie | Jeffrey T. Hancock
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies


pdf bib
Proceedings of the NAACL HLT 2012 Student Research Workshop
Rivka Levitan | Myle Ott | Roger Levy | Ani Nenkova
Proceedings of the NAACL HLT 2012 Student Research Workshop

pdf bib
In Search of a Gold Standard in Studies of Deception
Stephanie Gokhman | Jeff Hancock | Poornima Prabhu | Myle Ott | Claire Cardie
Proceedings of the Workshop on Computational Approaches to Deception Detection


pdf bib
Finding Deceptive Opinion Spam by Any Stretch of the Imagination
Myle Ott | Yejin Choi | Claire Cardie | Jeffrey T. Hancock
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies