Marina Fomicheva


2020

pdf bib
Multi-Hypothesis Machine Translation Evaluation
Marina Fomicheva | Lucia Specia | Francisco Guzmán
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Reliably evaluating Machine Translation (MT) through automated metrics is a long-standing problem. One of the main challenges is the fact that multiple outputs can be equally valid. Attempts to minimise this issue include metrics that relax the matching of MT output and reference strings, and the use of multiple references. The latter has been shown to significantly improve the performance of evaluation metrics. However, collecting multiple references is expensive and in practice a single reference is generally used. In this paper, we propose an alternative approach: instead of modelling linguistic variation in human reference we exploit the MT model uncertainty to generate multiple diverse translations and use these: (i) as surrogates to reference translations; (ii) to obtain a quantification of translation variability to either complement existing metric scores or (iii) replace references altogether. We show that for a number of popular evaluation metrics our variability estimates lead to substantial improvements in correlation with human judgements of quality by up 15%.

pdf bib
Exploring Model Consensus to Generate Translation Paraphrases
Zhenhao Li | Marina Fomicheva | Lucia Specia
Proceedings of the Fourth Workshop on Neural Generation and Translation

This paper describes our submission to the 2020 Duolingo Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). This task focuses on improving the ability of neural MT systems to generate diverse translations. Our submission explores various methods, including N-best translation, Monte Carlo dropout, Diverse Beam Search, Mixture of Experts, Ensembling, and Lexical Substitution. Our main submission is based on the integration of multiple translations from multiple methods using Consensus Voting. Experiments show that the proposed approach achieves a considerable degree of diversity without introducing noisy translations. Our final submission achieves a 0.5510 weighted F1 score on the blind test set for the English-Portuguese track.

pdf bib
Unsupervised Quality Estimation for Neural Machine Translation
Marina Fomicheva | Shuo Sun | Lisa Yankovskaya | Frédéric Blain | Francisco Guzmán | Mark Fishel | Nikolaos Aletras | Vishrav Chaudhary | Lucia Specia
Transactions of the Association for Computational Linguistics, Volume 8

Quality Estimation (QE) is an important component in making Machine Translation (MT) useful in real-world applications, as it is aimed to inform the user on the quality of the MT output at test time. Existing approaches require large amounts of expert annotated data, computation, and time for training. As an alternative, we devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required. Different from most of the current work that treats the MT system as a black box, we explore useful information that can be extracted from the MT system as a by-product of translation. By utilizing methods for uncertainty quantification, we achieve very good correlation with human judgments of quality, rivaling state-of-the-art supervised QE models. To evaluate our approach we collect the first dataset that enables work on both black-box and glass-box approaches to QE.

2019

pdf bib
Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments
Marina Fomicheva | Lucia Specia
Computational Linguistics, Volume 45, Issue 3 - September 2019

Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria.

2018

pdf bib
MAJE Submission to the WMT2018 Shared Task on Parallel Corpus Filtering
Marina Fomicheva | Jesús González-Rubio
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the participation of Webinterpret in the shared task on parallel corpus filtering at the Third Conference on Machine Translation (WMT 2018). The paper describes the main characteristics of our approach and discusses the results obtained on the data sets published for the shared task.

2016

pdf bib
Using Contextual Information for Machine Translation Evaluation
Marina Fomicheva | Núria Bel
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Automatic evaluation of Machine Translation (MT) is typically approached by measuring similarity between the candidate MT and a human reference translation. An important limitation of existing evaluation systems is that they are unable to distinguish candidate-reference differences that arise due to acceptable linguistic variation from the differences induced by MT errors. In this paper we present a new metric, UPF-Cobalt, that addresses this issue by taking into consideration the syntactic contexts of candidate and reference words. The metric applies a penalty when the words are similar but the contexts in which they occur are not equivalent. In this way, Machine Translations (MTs) that are different from the human translation but still essentially correct are distinguished from those that share high number of words with the reference but alter the meaning of the sentence due to translation errors. The results show that the method proposed is indeed beneficial for automatic MT evaluation. We report experiments based on two different evaluation tasks with various types of manual quality assessment. The metric significantly outperforms state-of-the-art evaluation systems in varying evaluation settings.

pdf bib
Reference Bias in Monolingual Machine Translation Evaluation
Marina Fomicheva | Lucia Specia
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
CobaltF: A Fluent Metric for MT Evaluation
Marina Fomicheva | Núria Bel | Lucia Specia | Iria da Cunha | Anton Malinovskiy
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
USFD at SemEval-2016 Task 1: Putting different State-of-the-Arts into a Box
Ahmet Aker | Frederic Blain | Andres Duque | Marina Fomicheva | Jurica Seva | Kashif Shah | Daniel Beck
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
UPF-Cobalt Submission to WMT15 Metrics Task
Marina Fomicheva | Núria Bel | Iria da Cunha | Anton Malinovskiy
Proceedings of the Tenth Workshop on Statistical Machine Translation

2014

pdf bib
Boosting the creation of a treebank
Blanca Arias | Núria Bel | Mercè Lorente | Montserrat Marimón | Alba Milà | Jorge Vivaldi | Muntsa Padró | Marina Fomicheva | Imanol Larrea
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present the results of an ongoing experiment of bootstrapping a Treebank for Catalan by using a Dependency Parser trained with Spanish sentences. In order to save time and cost, our approach was to profit from the typological similarities between Catalan and Spanish to create a first Catalan data set quickly by automatically: (i) annotating with a de-lexicalized Spanish parser, (ii) manually correcting the parses, and (iii) using the Catalan corrected sentences to train a Catalan parser. The results showed that the number of parsed sentences required to train a Catalan parser is about 1000 that were achieved in 4 months, with 2 annotators.