Francisco Guzmán

Also published as: Francisco Guzman


2020

pdf bib
Multi-Hypothesis Machine Translation Evaluation
Marina Fomicheva | Lucia Specia | Francisco Guzmán
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Reliably evaluating Machine Translation (MT) through automated metrics is a long-standing problem. One of the main challenges is the fact that multiple outputs can be equally valid. Attempts to minimise this issue include metrics that relax the matching of MT output and reference strings, and the use of multiple references. The latter has been shown to significantly improve the performance of evaluation metrics. However, collecting multiple references is expensive and in practice a single reference is generally used. In this paper, we propose an alternative approach: instead of modelling linguistic variation in human reference we exploit the MT model uncertainty to generate multiple diverse translations and use these: (i) as surrogates to reference translations; (ii) to obtain a quantification of translation variability to either complement existing metric scores or (iii) replace references altogether. We show that for a number of popular evaluation metrics our variability estimates lead to substantial improvements in correlation with human judgements of quality by up 15%.

pdf bib
Are we Estimating or Guesstimating Translation Quality?
Shuo Sun | Francisco Guzmán | Lucia Specia
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recent advances in pre-trained multilingual language models lead to state-of-the-art results on the task of quality estimation (QE) for machine translation. A carefully engineered ensemble of such models won the QE shared task at WMT19. Our in-depth analysis, however, shows that the success of using pre-trained language models for QE is over-estimated due to three issues we observed in current QE datasets: (i) The distributions of quality scores are imbalanced and skewed towards good quality scores; (iii) QE models can perform well on these datasets while looking at only source or translated sentences; (iii) They contain statistical artifacts that correlate well with human-annotated QE labels. Our findings suggest that although QE models might capture fluency of translated sentences and complexity of source sentences, they cannot model adequacy of translations effectively.

pdf bib
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau | Kartikay Khandelwal | Naman Goyal | Vishrav Chaudhary | Guillaume Wenzek | Francisco Guzmán | Edouard Grave | Myle Ott | Luke Zettlemoyer | Veselin Stoyanov
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code and models publicly available.

pdf bib
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
Guillaume Wenzek | Marie-Anne Lachaux | Alexis Conneau | Vishrav Chaudhary | Francisco Guzmán | Armand Joulin | Edouard Grave
Proceedings of the 12th Language Resources and Evaluation Conference

Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.

pdf bib
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
Ahmed El-Kishky | Vishrav Chaudhary | Francisco Guzmán | Philipp Koehn
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. In this paper, we exploit the signals embedded in URLs to label web documents at scale with an average precision of 94.5% across different language pairs. We mine sixty-eight snapshots of the Common Crawl corpus and identify web document pairs that are translations of each other. We release a new web dataset consisting of over 392 million URL pairs from Common Crawl covering documents in 8144 language pairs of which 137 pairs include English. In addition to curating this massive dataset, we introduce baseline methods that leverage cross-lingual representations to identify aligned documents based on their textual content. Finally, we demonstrate the value of this parallel documents dataset through a downstream task of mining parallel sentences and measuring the quality of machine translations from models trained on this mined data. Our objective in releasing this dataset is to foster new research in cross-lingual NLP across a variety of low, medium, and high-resource languages.

pdf bib
Unsupervised Quality Estimation for Neural Machine Translation
Marina Fomicheva | Shuo Sun | Lisa Yankovskaya | Frédéric Blain | Francisco Guzmán | Mark Fishel | Nikolaos Aletras | Vishrav Chaudhary | Lucia Specia
Transactions of the Association for Computational Linguistics, Volume 8

Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation, and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By utilizing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivaling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.

2019

pdf bib
The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English
Francisco Guzmán | Peng-Jen Chen | Myle Ott | Juan Pino | Guillaume Lample | Philipp Koehn | Vishrav Chaudhary | Marc’Aurelio Ranzato
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

For machine translation, a vast majority of language pairs in the world are considered low-resource because they have little parallel data available. Besides the technical challenges of learning with limited supervision, it is difficult to evaluate methods trained on low-resource language pairs because of the lack of freely and publicly available benchmarks. In this work, we introduce the FLORES evaluation datasets for Nepali–English and Sinhala– English, based on sentences translated from Wikipedia. Compared to English, these are languages with very different morphology and syntax, for which little out-of-domain parallel data is available and for which relatively large amounts of monolingual data are freely available. We describe our process to collect and cross-check the quality of translations, and we report baseline performance using several learning settings: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. Our experiments demonstrate that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT. Data and code to reproduce our experiments are available at https://github.com/facebookresearch/flores.

pdf bib
Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions
Philipp Koehn | Francisco Guzmán | Vishrav Chaudhary | Juan Pino
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

Following the WMT 2018 Shared Task on Parallel Corpus Filtering, we posed the challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs crawled from the web, with the goal of sub-selecting 2% and 10% of the highest-quality data to be used to train machine translation systems. This year, the task tackled the low resource condition of Nepali-English and Sinhala-English. Eleven participants from companies, national research labs, and universities participated in this task.

pdf bib
Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings
Vishrav Chaudhary | Yuqing Tang | Francisco Guzmán | Holger Schwenk | Philipp Koehn
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In this paper, we describe our submission to the WMT19 low-resource parallel corpus filtering shared task. Our main approach is based on the LASER toolkit (Language-Agnostic SEntence Representations), which uses an encoder-decoder architecture trained on a parallel corpus to obtain multilingual sentence representations. We then use the representations directly to score and filter the noisy parallel sentences without additionally training a scoring function. We contrast our approach to other promising methods and show that LASER yields strong results. Finally, we produce an ensemble of different scoring methods and obtain additional gains. Our submission achieved the best overall performance for both the Nepali-English and Sinhala-English 1M tasks by a margin of 1.3 and 1.4 BLEU respectively, as compared to the second best systems. Moreover, our experiments show that this technique is promising for low and even no-resource scenarios.

2017

pdf bib
Discourse Structure in Machine Translation Evaluation
Shafiq Joty | Francisco Guzmán | Lluís Màrquez | Preslav Nakov
Computational Linguistics, Volume 43, Issue 4 - December 2017

In this article, we explore the potential of using sentence-level discourse structure for machine translation evaluation. We first design discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory (RST). Then, we show that a simple linear combination with these measures can help improve various existing machine translation evaluation metrics regarding correlation with human judgments both at the segment level and at the system level. This suggests that discourse information is complementary to the information used by many of the existing evaluation metrics, and thus it could be taken into account when developing richer evaluation metrics, such as the WMT-14 winning combined metric DiscoTKparty. We also provide a detailed analysis of the relevance of various discourse elements and relations from the RST parse trees for machine translation evaluation. In particular, we show that (i) all aspects of the RST tree are relevant, (ii) nuclearity is more useful than relation type, and (iii) the similarity of the translation RST tree to the reference RST tree is positively correlated with translation quality.

2016

pdf bib
It Takes Three to Tango: Triangulation Approach to Answer Ranking in Community Question Answering
Preslav Nakov | Lluís Màrquez | Francisco Guzmán
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Eyes Don’t Lie: Predicting Machine Translation Quality Using Eye Movement
Hassan Sajjad | Francisco Guzmán | Nadir Durrani | Ahmed Abdelali | Houda Bouamor | Irina Temnikova | Stephan Vogel
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
iAppraise: A Manual Machine Translation Evaluation Environment Supporting Eye-tracking
Ahmed Abdelali | Nadir Durrani | Francisco Guzmán
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf bib
Machine Translation Evaluation for Arabic using Morphologically-enriched Embeddings
Francisco Guzmán | Houda Bouamor | Ramy Baly | Nizar Habash
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Evaluation of machine translation (MT) into morphologically rich languages (MRL) has not been well studied despite posing many challenges. In this paper, we explore the use of embeddings obtained from different levels of lexical and morpho-syntactic linguistic analysis and show that they improve MT evaluation into an MRL. Specifically we report on Arabic, a language with complex and rich morphology. Our results show that using a neural-network model with different input representations produces results that clearly outperform the state-of-the-art for MT evaluation into Arabic, by almost over 75% increase in correlation with human judgments on pairwise MT evaluation quality task. More importantly, we demonstrate the usefulness of morpho-syntactic representations to model sentence similarity for MT evaluation and address complex linguistic phenomena of Arabic.

pdf bib
Machine Translation Evaluation Meets Community Question Answering
Francisco Guzmán | Lluís Màrquez | Preslav Nakov
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
MTE-NN at SemEval-2016 Task 3: Can Machine Translation Evaluation Help Community Question Answering?
Francisco Guzmán | Preslav Nakov | Lluís Màrquez
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Analyzing Optimization for Statistical Machine Translation: MERT Learns Verbosity, PRO Learns Length
Francisco Guzmán | Preslav Nakov | Stephan Vogel
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

pdf bib
How do Humans Evaluate Machine Translation
Francisco Guzmán | Ahmed Abdelali | Irina Temnikova | Hassan Sajjad | Stephan Vogel
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Pairwise Neural Machine Translation Evaluation
Francisco Guzmán | Shafiq Joty | Lluís Màrquez | Preslav Nakov
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
Using Discourse Structure Improves Machine Translation Evaluation
Francisco Guzmán | Shafiq Joty | Lluís Màrquez | Preslav Nakov
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
DiscoTK: Using Discourse Structure for Machine Translation Evaluation
Shafiq Joty | Francisco Guzmán | Lluís Màrquez | Preslav Nakov
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Learning to Differentiate Better from Worse Translations
Francisco Guzmán | Shafiq Joty | Lluís Màrquez | Alessandro Moschitti | Preslav Nakov | Massimo Nicosia
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
The AMARA Corpus: Building Parallel Language Resources for the Educational Domain
Ahmed Abdelali | Francisco Guzman | Hassan Sajjad | Stephan Vogel
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents the AMARA corpus of on-line educational content: a new parallel corpus of educational video subtitles, multilingually aligned for 20 languages, i.e. 20 monolingual corpora and 190 parallel corpora. This corpus includes both resource-rich languages such as English and Arabic, and resource-poor languages such as Hindi and Thai. In this paper, we describe the gathering, validation, and preprocessing of a large collection of parallel, community-generated subtitles. Furthermore, we describe the methodology used to prepare the data for Machine Translation tasks. Additionally, we provide a document-level, jointly aligned development and test sets for 14 language pairs, designed for tuning and testing Machine Translation systems. We provide baseline results for these tasks, and highlight some of the challenges we face when building machine translation systems for educational content.

2013

pdf bib
A Tale about PRO and Monsters
Preslav Nakov | Francisco Guzmán | Stephan Vogel
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Parameter Optimization for Statistical Machine Translation: It Pays to Learn from Hard Examples
Preslav Nakov | Fahad Al Obaidli | Francisco Guzmán | Stephan Vogel
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib
QCRI at WMT12: Experiments in Spanish-English and German-English Machine Translation of News Text
Francisco Guzmán | Preslav Nakov | Ahmed Thabet | Stephan Vogel
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
Understanding the Performance of Statistical MT Systems: A Linear Regression Framework
Francisco Guzman | Stephan Vogel
Proceedings of COLING 2012

pdf bib
Optimizing for Sentence-Level BLEU+1 Yields Short Translations
Preslav Nakov | Francisco Guzman | Stephan Vogel
Proceedings of COLING 2012

2010

pdf bib
EMDC: A Semi-supervised Approach for Word Alignment
Qin Gao | Francisco Guzman | Stephan Vogel
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)