Matteo Negri

Also published as: M. Negri


2020

pdf bib
Gender in Danger? Evaluating Speech Translation Technology on the MuST-SHE Corpus
Luisa Bentivogli | Beatrice Savoldi | Matteo Negri | Mattia A. Di Gangi | Roldano Cattoni | Marco Turchi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Translating from languages without productive grammatical gender like English into gender-marked languages is a well-known difficulty for machines. This difficulty is also due to the fact that the training data on which models are built typically reflect the asymmetries of natural languages, gender bias included. Exclusively fed with textual data, machine translation is intrinsically constrained by the fact that the input sentence does not always contain clues about the gender identity of the referred human entities. But what happens with speech translation, where the input is an audio signal? Can audio provide additional information to reduce gender bias? We present the first thorough investigation of gender bias in speech translation, contributing with: i) the release of a benchmark useful for future studies, and ii) the comparison of different technologies (cascade and end-to-end) on two language directions (English-Italian/French).

pdf bib
MuST-Cinema: a Speech-to-Subtitles corpus
Alina Karakanta | Matteo Negri | Marco Turchi
Proceedings of the 12th Language Resources and Evaluation Conference

Growing needs in localising audiovisual content in multiple languages through subtitles call for the development of automatic solutions for human subtitling. Neural Machine Translation (NMT) can contribute to the automatisation of subtitling, facilitating the work of human subtitlers and reducing turn-around times and related costs. NMT requires high-quality, large, task-specific training data. The existing subtitling corpora, however, are missing both alignments to the source language audio and important information about subtitle breaks. This poses a significant limitation for developing efficient automatic approaches for subtitling, since the length and form of a subtitle directly depends on the duration of the utterance. In this work, we present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles. The corpus is comprised of (audio, transcription, translation) triplets. Subtitle breaks are preserved by inserting special symbols. We show that the corpus can be used to build models that efficiently segment sentences into subtitles and propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.

pdf bib
On Target Segmentation for Direct Speech Translation
Mattia A. Di Gangi | Marco Gaido | Matteo Negri | Marco Turchi
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
Machine-oriented NMT Adaptation for Zero-shot NLP tasks: Comparing the Usefulness of Close and Distant Languages
Amirhossein Tebbifakhr | Matteo Negri | Marco Turchi
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

Neural Machine Translation (NMT) models are typically trained by considering humans as end-users and maximizing human-oriented objectives. However, in some scenarios, their output is consumed by automatic NLP components rather than by humans. In these scenarios, translations’ quality is measured in terms of their “fitness for purpose” (i.e. maximizing performance of external NLP tools) rather than in terms of standard human fluency/adequacy criteria. Recently, reinforcement learning techniques exploiting the feedback from downstream NLP tools have been proposed for “machine-oriented” NMT adaptation. In this work, we tackle the problem in a multilingual setting where a single NMT model translates from multiple languages for downstream automatic processing in the target language. Knowledge sharing across close and distant languages allows to apply our machine-oriented approach in the zero-shot setting where no labeled data for the test language is seen at training time. Moreover, we incorporate multi-lingual BERT in the source side of our NMT system to benefit from the knowledge embedded in this model. Our experiments show coherent performance gains, for different language directions over both i) “generic” NMT models (trained for human consumption), and ii) fine-tuned multilingual BERT. This gain for zero-shot language directions (e.g. Spanish–English) is higher when the models are fine-tuned on a closely-related source language (Italian) than a distant one (German).

pdf bib
FINDINGS OF THE IWSLT 2020 EVALUATION CAMPAIGN
Ebrahim Ansari | Amittai Axelrod | Nguyen Bach | Ondřej Bojar | Roldano Cattoni | Fahim Dalvi | Nadir Durrani | Marcello Federico | Christian Federmann | Jiatao Gu | Fei Huang | Kevin Knight | Xutai Ma | Ajay Nagesh | Matteo Negri | Jan Niehues | Juan Pino | Elizabeth Salesky | Xing Shi | Sebastian Stüker | Marco Turchi | Alexander Waibel | Changhan Wang
Proceedings of the 17th International Conference on Spoken Language Translation

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2020) featured this year six challenge tracks: (i) Simultaneous speech translation, (ii) Video speech translation, (iii) Offline speech translation, (iv) Conversational speech translation, (v) Open domain translation, and (vi) Non-native speech translation. A total of teams participated in at least one of the tracks. This paper introduces each track’s goal, data and evaluation metrics, and reports the results of the received submissions.

pdf bib
End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020
Marco Gaido | Mattia A. Di Gangi | Matteo Negri | Marco Turchi
Proceedings of the 17th International Conference on Spoken Language Translation

This paper describes FBK’s participation in the IWSLT 2020 offline speech translation (ST) task. The task evaluates systems’ ability to translate English TED talks audio into German texts. The test talks are provided in two versions: one contains the data already segmented with automatic tools and the other is the raw data without any segmentation. Participants can decide whether to work on custom segmentation or not. We used the provided segmentation. Our system is an end-to-end model based on an adaptation of the Transformer for speech data. Its training process is the main focus of this paper and it is based on: i) transfer learning (ASR pretraining and knowledge distillation), ii) data augmentation (SpecAugment, time stretch and synthetic data), iii)combining synthetic and real data marked as different domains, and iv) multi-task learning using the CTC loss. Finally, after the training with word-level knowledge distillation is complete, our ST models are fine-tuned using label smoothed cross entropy. Our best model scored 29 BLEU on the MuST-CEn-De test set, which is an excellent result compared to recent papers, and 23.7 BLEU on the same data segmented with VAD, showing the need for researching solutions addressing this specific data condition.

pdf bib
Is 42 the Answer to Everything in Subtitling-oriented Speech Translation?
Alina Karakanta | Matteo Negri | Marco Turchi
Proceedings of the 17th International Conference on Spoken Language Translation

Subtitling is becoming increasingly important for disseminating information, given the enormous amounts of audiovisual content becoming available daily. Although Neural Machine Translation (NMT) can speed up the process of translating audiovisual content, large manual effort is still required for transcribing the source language, and for spotting and segmenting the text into proper subtitles. Creating proper subtitles in terms of timing and segmentation highly depends on information present in the audio (utterance duration, natural pauses). In this work, we explore two methods for applying Speech Translation (ST) to subtitling, a) a direct end-to-end and b) a classical cascade approach. We discuss the benefit of having access to the source language speech for improving the conformity of the generated subtitles to the spatial and temporal subtitling constraints and show that length is not the answer to everything in the case of subtitling-oriented ST.

pdf bib
Breeding Gender-aware Direct Speech Translation Systems
Marco Gaido | Beatrice Savoldi | Luisa Bentivogli | Matteo Negri | Marco Turchi
Proceedings of the 28th International Conference on Computational Linguistics

In automatic speech translation (ST), traditional cascade approaches involving separate transcription and translation steps are giving ground to increasingly competitive and more robust direct solutions. In particular, by translating speech audio data without intermediate transcription, direct ST models are able to leverage and preserve essential information present in the input (e.g.speaker’s vocal characteristics) that is otherwise lost in the cascade framework. Although such ability proved to be useful for gender translation, direct ST is nonetheless affected by gender bias just like its cascade counterpart, as well as machine translation and numerous other natural language processing applications. Moreover, direct ST systems that exclusively rely on vocal biometric features as a gender cue can be unsuitable or even potentially problematic for certain users. Going beyond speech signals, in this paper we compare different approaches to inform direct ST models about the speaker’s gender and test their ability to handle gender translation from English into Italian and French. To this aim, we manually annotated large datasets with speak-ers’ gender information and used them for experiments reflecting different possible real-world scenarios. Our results show that gender-aware direct ST solutions can significantly outperform strong – but gender-unaware – direct ST models. In particular, the translation of gender-marked words can increase up to 30 points in accuracy while preserving overall translation quality.

pdf bib
The Two Shades of Dubbing in Neural Machine Translation
Alina Karakanta | Supratik Bhattacharya | Shravan Nayak | Timo Baumann | Matteo Negri | Marco Turchi
Proceedings of the 28th International Conference on Computational Linguistics

Dubbing has two shades; synchronisation constraints are applied only when the actor’s mouth is visible on screen, while the translation is unconstrained for off-screen dubbing. Consequently, different synchronisation requirements, and therefore translation strategies, are applied depending on the type of dubbing. In this work, we manually annotate an existing dubbing corpus (Heroes) for this dichotomy. We show that, even though we did not observe distinctive features between on- and off-screen dubbing at the textual level, on-screen dubbing is more difficult for MT (-4 BLEU points). Moreover, synchronisation constraints dramatically decrease translation quality for off-screen dubbing. We conclude that, distinguishing between on-screen and off-screen dubbing is necessary for determining successful strategies for dubbing-customised Machine Translation.

pdf bib
Automatic Translation for Multiple NLP tasks: a Multi-task Approach to Machine-oriented NMT Adaptation
Amirhossein Tebbifakhr | Matteo Negri | Marco Turchi
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Although machine translation (MT) traditionally pursues “human-oriented” objectives, humans are not the only possible consumers of MT output. For instance, when automatic translations are used to feed downstream Natural Language Processing (NLP) components in cross-lingual settings, they should ideally pursue “machine-oriented” objectives that maximize the performance of these components. Tebbifakhr et al. (2019) recently proposed a reinforcement learning approach to adapt a generic neural MT(NMT) system by exploiting the reward from a downstream sentiment classifier. But what if the downstream NLP tasks to serve are more than one? How to avoid the costs of adapting and maintaining one dedicated NMT system for each task? We address this problem by proposing a multi-task approach to machine-oriented NMT adaptation, which is capable to serve multiple downstream tasks with a single system. Through experiments with Spanish and Italian data covering three different tasks, we show that our approach can outperform a generic NMT system, and compete with single-task models in most of the settings.

2019

pdf bib
Machine Translation for Machines: the Sentiment Classification Use Case
Amirhossein Tebbifakhr | Luisa Bentivogli | Matteo Negri | Marco Turchi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We propose a neural machine translation (NMT) approach that, instead of pursuing adequacy and fluency (“human-oriented” quality criteria), aims to generate translations that are best suited as input to a natural language processing component designed for a specific downstream task (a “machine-oriented” criterion). Towards this objective, we present a reinforcement learning technique based on a new candidate sampling strategy, which exploits the results obtained on the downstream task as weak feedback. Experiments in sentiment classification of Twitter data in German and Italian show that feeding an English classifier with “machine-oriented” translations significantly improves its performance. Classification results outperform those obtained with translations produced by general-purpose NMT models as well as by an approach based on reinforcement learning. Moreover, our results on both languages approximate the classification accuracy computed on gold standard English tweets.

pdf bib
Neural Text Simplification in Low-Resource Conditions Using Weak Supervision
Alessio Palmero Aprosio | Sara Tonelli | Marco Turchi | Matteo Negri | Mattia A. Di Gangi
Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation

Neural text simplification has gained increasing attention in the NLP community thanks to recent advancements in deep sequence-to-sequence learning. Most recent efforts with such a data-demanding paradigm have dealt with the English language, for which sizeable training datasets are currently available to deploy competitive models. Similar improvements on less resource-rich languages are conditioned either to intensive manual work to create training data, or to the design of effective automatic generation techniques to bypass the data acquisition bottleneck. Inspired by the machine translation field, in which synthetic parallel pairs generated from monolingual data yield significant improvements to neural models, in this paper we exploit large amounts of heterogeneous data to automatically select simple sentences, which are then used to create synthetic simplification pairs. We also evaluate other solutions, such as oversampling and the use of external word embeddings to be fed to the neural simplification system. Our approach is evaluated on Italian and Spanish, for which few thousand gold sentence pairs are available. The results show that these techniques yield performance improvements over a baseline sequence-to-sequence configuration.

pdf bib
Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Marco Turchi | Karin Verspoor
Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)

pdf bib
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Marco Turchi | Karin Verspoor
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

pdf bib
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Marco Turchi | Karin Verspoor
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

pdf bib
Findings of the WMT 2019 Shared Task on Automatic Post-Editing
Rajen Chatterjee | Christian Federmann | Matteo Negri | Marco Turchi
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

We present the results from the 5th round of the WMT task on MT Automatic Post-Editing. The task consists in automatically correcting the output of a “black-box” machine translation system by learning from human corrections. Keeping the same general evaluation setting of the previous four rounds, this year we focused on two language pairs (English-German and English-Russian) and on domain-specific data (In-formation Technology). For both the language directions, MT outputs were produced by neural systems unknown to par-ticipants. Seven teams participated in the English-German task, with a total of 18 submitted runs. The evaluation, which was performed on the same test set used for the 2018 round, shows a slight progress in APE technology: 4 teams achieved better results than last year’s winning system, with improvements up to -0.78 TER and +1.23 BLEU points over the baseline. Two teams participated in theEnglish-Russian task submitting 2 runs each. On this new language direction, characterized by a higher quality of the original translations, the task proved to be particularly challenging. None of the submitted runs improved the very high results of the strong system used to produce the initial translations(16.16 TER, 76.20 BLEU).

pdf bib
Effort-Aware Neural Automatic Post-Editing
Amirhossein Tebbifakhr | Matteo Negri | Marco Turchi
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

For this round of the WMT 2019 APE shared task, our submission focuses on addressing the “over-correction” problem in APE. Over-correction occurs when the APE system tends to rephrase an already correct MT output, and the resulting sentence is penalized by a reference-based evaluation against human post-edits. Our intuition is that this problem can be prevented by informing the system about the predicted quality of the MT output or, in other terms, the expected amount of needed corrections. For this purpose, following the common approach in multilingual NMT, we prepend a special token to the beginning of both the source text and the MT output indicating the required amount of post-editing. Following the best submissions to the WMT 2018 APE shared task, our backbone architecture is based on multi-source Transformer to encode both the MT output and the corresponding source text. We participated both in the English-German and English-Russian subtasks. In the first subtask, our best submission improved the original MT output quality up to +0.98 BLEU and -0.47 TER. In the second subtask, where the higher quality of the MT output increases the risk of over-correction, none of our submitted runs was able to improve the MT output.

pdf bib
Enhancing Transformer for End-to-end Speech-to-Text Translation
Mattia Antonino Di Gangi | Matteo Negri | Roldano Cattoni | Roberto Dessi | Marco Turchi
Proceedings of Machine Translation Summit XVII Volume 1: Research Track

pdf bib
Improving Translations by Combining Fuzzy-Match Repair with Automatic Post-Editing
John Ortega | Felipe Sánchez-Martínez | Marco Turchi | Matteo Negri
Proceedings of Machine Translation Summit XVII Volume 1: Research Track

pdf bib
MuST-C: a Multilingual Speech Translation Corpus
Mattia A. Di Gangi | Roldano Cattoni | Luisa Bentivogli | Matteo Negri | Marco Turchi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Current research on spoken language translation (SLT) has to confront with the scarcity of sizeable and publicly available training corpora. This problem hinders the adoption of neural end-to-end approaches, which represent the state of the art in the two parent tasks of SLT: automatic speech recognition and machine translation. To fill this gap, we created MuST-C, a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems for SLT from English into 8 languages. For each target language, MuST-C comprises at least 385 hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations. Together with a description of the corpus creation methodology (scalable to add new data and cover new languages), we provide an empirical verification of its quality and SLT results computed with a state-of-the-art approach on each language direction.

2018

pdf bib
ESCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing
Matteo Negri | Marco Turchi | Rajen Chatterjee | Nicola Bertoldi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Combining Quality Estimation and Automatic Post-editing to Enhance Machine Translation output
Rajen Chatterjee | Matteo Negri | Marco Turchi | Frédéric Blain | Lucia Specia
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
Proceedings of the Third Conference on Machine Translation: Research Papers
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Lucia Specia | Marco Turchi | Karin Verspoor
Proceedings of the Third Conference on Machine Translation: Research Papers

bib
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Lucia Specia | Marco Turchi | Karin Verspoor
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

pdf bib
Findings of the WMT 2018 Shared Task on Automatic Post-Editing
Rajen Chatterjee | Matteo Negri | Raphael Rubino | Marco Turchi
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present the results from the fourth round of the WMT shared task on MT Automatic Post-Editing. The task consists in automatically correcting the output of a “black-box” machine translation system by learning from human corrections. Keeping the same general evaluation setting of the three previous rounds, this year we focused on one language pair (English-German) and on domain-specific data (Information Technology), with MT outputs produced by two different paradigms: phrase-based (PBSMT) and neural (NMT). Five teams submitted respectively 11 runs for the PBSMT subtask and 10 runs for the NMT subtask. In the former subtask, characterized by original translations of lower quality, top results achieved impressive improvements, up to -6.24 TER and +9.53 BLEU points over the baseline “do-nothing” system. The NMT subtask proved to be more challenging due to the higher quality of the original translations and the availability of less training data. In this case, top results show smaller improvements up to -0.38 TER and +0.8 BLEU points.

pdf bib
Multi-source transformer with combined losses for automatic post editing
Amirhossein Tebbifakhr | Ruchit Agrawal | Matteo Negri | Marco Turchi
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

Recent approaches to the Automatic Post-editing (APE) of Machine Translation (MT) have shown that best results are obtained by neural multi-source models that correct the raw MT output by also considering information from the corresponding source sentence. To this aim, we present for the first time a neural multi-source APE model based on the Transformer architecture. Moreover, we employ sequence-level loss functions in order to avoid exposure bias during training and to be consistent with the automatic evaluation metrics used for the task. These are the main features of our submissions to the WMT 2018 APE shared task, where we participated both in the PBSMT subtask (i.e. the correction of MT outputs from a phrase-based system) and in the NMT subtask (i.e. the correction of neural outputs). In the first subtask, our system improves over the baseline up to -5.3 TER and +8.23 BLEU points ranking second out of 11 submitted runs. In the second one, characterized by the higher quality of the initial translations, we report lower but statistically significant gains (up to -0.38 TER and +0.8 BLEU), ranking first out of 10 submissions.

pdf bib
Generating E-Commerce Product Titles and Predicting their Quality
José G. Camargo de Souza | Michael Kozielski | Prashant Mathur | Ernie Chang | Marco Guerini | Matteo Negri | Marco Turchi | Evgeny Matusov
Proceedings of the 11th International Conference on Natural Language Generation

E-commerce platforms present products using titles that summarize product information. These titles cannot be created by hand, therefore an algorithmic solution is required. The task of automatically generating these titles given noisy user provided titles is one way to achieve the goal. The setting requires the generation process to be fast and the generated title to be both human-readable and concise. Furthermore, we need to understand if such generated titles are usable. As such, we propose approaches that (i) automatically generate product titles, (ii) predict their quality. Our approach scales to millions of products and both automatic and human evaluations performed on real-world data indicate our approaches are effective and applicable to existing e-commerce scenarios.

2017

pdf bib
Online Automatic Post-editing for MT in a Multi-Domain Translation Environment
Rajen Chatterjee | Gebremedhen Gebremelak | Matteo Negri | Marco Turchi
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Automatic post-editing (APE) for machine translation (MT) aims to fix recurrent errors made by the MT decoder by learning from correction examples. In controlled evaluation scenarios, the representativeness of the training set with respect to the test data is a key factor to achieve good performance. Real-life scenarios, however, do not guarantee such favorable learning conditions. Ideally, to be integrated in a real professional translation workflow (e.g. to play a role in computer-assisted translation framework), APE tools should be flexible enough to cope with continuous streams of diverse data coming from different domains/genres. To cope with this problem, we propose an online APE framework that is: i) robust to data diversity (i.e. capable to learn and apply correction rules in the right contexts) and ii) able to evolve over time (by continuously extending and refining its knowledge). In a comparative evaluation, with English-German test data coming in random order from two different domains, we show the effectiveness of our approach, which outperforms a strong batch system and the state of the art in online APE.

pdf bib
Neural vs. Phrase-Based Machine Translation in a Multi-Domain Scenario
M. Amin Farajian | Marco Turchi | Matteo Negri | Nicola Bertoldi | Marcello Federico
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

State-of-the-art neural machine translation (NMT) systems are generally trained on specific domains by carefully selecting the training sets and applying proper domain adaptation techniques. In this paper we consider the real world scenario in which the target domain is not predefined, hence the system should be able to translate text from multiple domains. We compare the performance of a generic NMT system and phrase-based statistical machine translation (PBMT) system by training them on a generic parallel corpus composed of data from different domains. Our results on multi-domain English-French data show that, in these realistic conditions, PBMT outperforms its neural counterpart. This raises the question: is NMT ready for deployment as a generic/multi-purpose MT backbone in real-world settings?

pdf bib
Multi-Domain Neural Machine Translation through Unsupervised Adaptation
M. Amin Farajian | Marco Turchi | Matteo Negri | Marcello Federico
Proceedings of the Second Conference on Machine Translation

pdf bib
Guiding Neural Machine Translation Decoding with External Knowledge
Rajen Chatterjee | Matteo Negri | Marco Turchi | Marcello Federico | Lucia Specia | Frédéric Blain
Proceedings of the Second Conference on Machine Translation

pdf bib
Findings of the 2017 Conference on Machine Translation (WMT17)
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Yvette Graham | Barry Haddow | Shujian Huang | Matthias Huck | Philipp Koehn | Qun Liu | Varvara Logacheva | Christof Monz | Matteo Negri | Matt Post | Raphael Rubino | Lucia Specia | Marco Turchi
Proceedings of the Second Conference on Machine Translation

pdf bib
Multi-source Neural Automatic Post-Editing: FBK’s participation in the WMT 2017 APE shared task
Rajen Chatterjee | M. Amin Farajian | Matteo Negri | Marco Turchi | Ankit Srivastava | Santanu Pal
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
An Unsupervised Method for Automatic Translation Memory Cleaning
Masoud Jalili Sabet | Matteo Negri | Marco Turchi | Eduard Barbu
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
TranscRater: a Tool for Automatic Speech Recognition Quality Estimation
Shahab Jalalvand | Matteo Negri | Marco Turchi | José G. C. de Souza | Daniele Falavigna | Mohammed R. H. Qwaider
Proceedings of ACL-2016 System Demonstrations

pdf bib
TMop: a Tool for Unsupervised Translation Memory Cleaning
Masoud Jalili Sabet | Matteo Negri | Marco Turchi | José G. C. de Souza | Marcello Federico
Proceedings of ACL-2016 System Demonstrations

pdf bib
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers
Ondřej Bojar | Christian Buck | Rajen Chatterjee | Christian Federmann | Liane Guillou | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Aurélie Névéol | Mariana Neves | Pavel Pecina | Martin Popel | Philipp Koehn | Christof Monz | Matteo Negri | Matt Post | Lucia Specia | Karin Verspoor | Jörg Tiedemann | Marco Turchi
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers

bib
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Ondřej Bojar | Christian Buck | Rajen Chatterjee | Christian Federmann | Liane Guillou | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Aurélie Névéol | Mariana Neves | Pavel Pecina | Martin Popel | Philipp Koehn | Christof Monz | Matteo Negri | Matt Post | Lucia Specia | Karin Verspoor | Jörg Tiedemann | Marco Turchi
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Findings of the 2016 Conference on Machine Translation
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Varvara Logacheva | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Martin Popel | Matt Post | Raphael Rubino | Carolina Scarton | Lucia Specia | Marco Turchi | Karin Verspoor | Marcos Zampieri
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
The FBK Participation in the WMT 2016 Automatic Post-editing Shared Task
Rajen Chatterjee | José G. C. de Souza | Matteo Negri | Marco Turchi
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
FBK HLT-MT at SemEval-2016 Task 1: Cross-lingual Semantic Similarity Measurement Using Quality Estimation Features and Compositional Bilingual Word Embeddings
Duygu Ataman | José G. C. de Souza | Marco Turchi | Matteo Negri
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Findings of the 2015 Workshop on Statistical Machine Translation
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Barry Haddow | Matthias Huck | Chris Hokamp | Philipp Koehn | Varvara Logacheva | Christof Monz | Matteo Negri | Matt Post | Carolina Scarton | Lucia Specia | Marco Turchi
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
The FBK Participation in the WMT15 Automatic Post-editing Shared Task
Rajen Chatterjee | Marco Turchi | Matteo Negri
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Multitask Learning for Adaptive Quality Estimation of Automatically Transcribed Utterances
José G. C. de Souza | Hamed Zamani | Matteo Negri | Marco Turchi | Daniele Falavigna
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Online Multitask Learning for Machine Translation Quality Estimation
José G. C. de Souza | Matteo Negri | Elisa Ricci | Marco Turchi
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Driving ROVER with Segment-based ASR Quality Estimation
Shahab Jalalvand | Matteo Negri | Daniele Falavigna | Marco Turchi
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Exploring the Planet of the APEs: a Comparative Study of State-of-the-art Methods for MT Automatic Post-Editing
Rajen Chatterjee | Marion Weller | Matteo Negri | Marco Turchi
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
MT Quality Estimation for Computer-assisted Translation: Does it Really Help?
Marco Turchi | Matteo Negri | Marcello Federico
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
Adaptive Quality Estimation for Machine Translation
Marco Turchi | Antonios Anastasopoulos | José G. C. de Souza | Matteo Negri
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
FBK-UPV-UEdin participation in the WMT14 Quality Estimation shared-task
José Guilherme Camargo de Souza | Jesús González-Rubio | Christian Buck | Marco Turchi | Matteo Negri
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Machine Translation Quality Estimation Across Domains
José G. C. de Souza | Marco Turchi | Matteo Negri
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Quality Estimation for Automatic Speech Recognition
Matteo Negri | Marco Turchi | José G. C. de Souza | Daniele Falavigna
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
The MateCat Tool
Marcello Federico | Nicola Bertoldi | Mauro Cettolo | Matteo Negri | Marco Turchi | Marco Trombetti | Alessandro Cattelan | Antonio Farina | Domenico Lupinetti | Andrea Martines | Alberto Massidda | Holger Schwenk | Loïc Barrault | Frederic Blain | Philipp Koehn | Christian Buck | Ulrich Germann
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations

pdf bib
Assessing the Impact of Translation Errors on Machine Translation Quality with Mixed-effects Models
Marcello Federico | Matteo Negri | Luisa Bentivogli | Marco Turchi
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Automatic Annotation of Machine Translation Datasets with Binary Quality Judgements
Marco Turchi | Matteo Negri
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The automatic estimation of machine translation (MT) output quality is an active research area due to its many potential applications (e.g. aiding human translation and post-editing, re-ranking MT hypotheses, MT system combination). Current approaches to the task rely on supervised learning methods for which high-quality labelled data is fundamental. In this framework, quality estimation (QE) has been mainly addressed as a regression problem where models trained on (source, target) sentence pairs annotated with continuous scores (in the [0-1] interval) are used to assign quality scores (in the same interval) to unseen data. Such definition of the problem assumes that continuous scores are informative and easily interpretable by different users. These assumptions, however, conflict with the subjectivity inherent to human translation and evaluation. On one side, the subjectivity of human judgements adds noise and biases to annotations based on scaled values. This problem reduces the usability of the resulting datasets, especially in application scenarios where a sharp distinction between “good” and “bad” translations is needed. On the other side, continuous scores are not always sufficient to decide whether a translation is actually acceptable or not. To overcome these issues, we present an automatic method for the annotation of (source, target) pairs with binary judgements that reflect an empirical, and easily interpretable notion of quality. The method is applied to annotate with binary judgements three QE datasets for different language combinations. The three datasets are combined in a single resource, called BinQE, which can be freely downloaded from http://hlt.fbk.eu/technologies/binqe.

2013

pdf bib
Coping with the Subjectivity of Human Judgements in MT Quality Estimation
Marco Turchi | Matteo Negri | Marcello Federico
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
FBK-UEdin Participation to the WMT13 Quality Estimation Shared Task
José Guilherme Camargo de Souza | Christian Buck | Marco Turchi | Matteo Negri
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
Exploiting Qualitative Information from Automatic Word Alignment for Cross-lingual NLP Tasks
José G.C. de Souza | Miquel Esplà-Gomis | Marco Turchi | Matteo Negri
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Semeval-2013 Task 8: Cross-lingual Textual Entailment for Content Synchronization
Matteo Negri | Alessandro Marchetti | Yashar Mehdad | Luisa Bentivogli | Danilo Giampiccolo
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
ALTN: Word Alignment Features for Cross-lingual Textual Entailment
Marco Turchi | Matteo Negri
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf bib
Semeval-2012 Task 8: Cross-lingual Textual Entailment for Content Synchronization
Matteo Negri | Alessandro Marchetti | Yashar Mehdad | Luisa Bentivogli | Danilo Giampiccolo
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
FBK: Machine Translation Evaluation and Word Similarity metrics for Semantic Textual Similarity
José Guilherme Camargo de Souza | Matteo Negri | Yashar Mehdad
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
FBK: Cross-Lingual Textual Entailment Without Translation
Yashar Mehdad | Matteo Negri | José Guilherme C. de Souza
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents
Yashar Mehdad | Matteo Negri | Marcello Federico
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Chinese Whispers: Cooperative Paraphrase Acquisition
Matteo Negri | Yashar Mehdad | Alessandro Marchetti | Danilo Giampiccolo | Luisa Bentivogli
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a framework for the acquisition of sentential paraphrases based on crowdsourcing. The proposed method maximizes the lexical divergence between an original sentence s and its valid paraphrases by running a sequence of paraphrasing jobs carried out by a crowd of non-expert workers. Instead of collecting direct paraphrases of s, at each step of the sequence workers manipulate semantically equivalent reformulations produced in the previous round. We applied this method to paraphrase English sentences extracted from Wikipedia. Our results show that, keeping at each round n the most promising paraphrases (i.e. the more lexically dissimilar from those acquired at round n-1), the monotonic increase of divergence allows to collect good-quality paraphrases in a cost-effective manner.

pdf bib
Match without a Referee: Evaluating MT Adequacy without Reference Translations
Yashar Mehdad | Matteo Negri | Marcello Federico
Proceedings of the Seventh Workshop on Statistical Machine Translation

2011

pdf bib
Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora
Matteo Negri | Luisa Bentivogli | Yashar Mehdad | Danilo Giampiccolo | Alessandro Marchetti
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Is it Worth Submitting this Run? Assess your RTE System with a Good Sparring Partner
Milen Kouylekov | Yashar Mehdad | Matteo Negri
Proceedings of the TextInfer 2011 Workshop on Textual Entailment

pdf bib
Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
Yashar Mehdad | Matteo Negri | Marcello Federico
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Towards Cross-Lingual Textual Entailment
Yashar Mehdad | Matteo Negri | Marcello Federico
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Creating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk: $100 for a 10-day Rush
Matteo Negri | Yashar Mehdad
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
An Open-Source Package for Recognizing Textual Entailment
Milen Kouylekov | Matteo Negri
Proceedings of the ACL 2010 System Demonstrations

pdf bib
FBK_NK: A WordNet-Based System for Multi-Way Classification of Semantic Relations
Matteo Negri | Milen Kouylekov
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Mining Wikipedia for Large-scale Repositories of Context-Sensitive Entailment Rules
Milen Kouylekov | Yashar Mehdad | Matteo Negri
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper focuses on the central role played by lexical information in the task of Recognizing Textual Entailment. In particular, the usefulness of lexical knowledge extracted from several widely used static resources, represented in the form of entailment rules, is compared with a method to extract lexical information from Wikipedia as a dynamic knowledge resource. The proposed acquisition method aims at maximizing two key features of the resulting entailment rules: coverage (i.e. the proportion of rules successfully applied over a dataset of TE pairs), and context sensitivity (i.e. the proportion of rules applied in appropriate contexts). Evaluation results show that Wikipedia can be effectively used as a source of lexical entailment rules, featuring both higher coverage and context sensitivity with respect to other resources.

2009

pdf bib
Question Answering over Structured Data: an Entailment-Based Approach to Question Analysis
Matteo Negri | Milen Kouylekov
Proceedings of the International Conference RANLP-2009

2008

pdf bib
Development and Alignment of a Domain-Specific Ontology for Question Answering
Shiyan Ou | Viktor Pekar | Constantin Orasan | Christian Spurk | Matteo Negri
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

With the appearance of Semantic Web technologies, it becomes possible to develop novel, sophisticated question answering systems, where ontologies are usually used as the core knowledge component. In the EU-funded project, QALL-ME, a domain-specific ontology was developed and applied for question answering in the domain of tourism, along with the assistance of two upper ontologies for concept expansion and reasoning. This paper focuses on the development of the QALL-ME ontology in the tourism domain and its alignment with the upper ontologies - WordNet and SUMO. The design of the ontology is presented in the paper, and a semi-automatic alignment procedure is described with some alignment results given as well. Furthermore, the aligned ontology was used to semantically annotate original data obtained from the tourism web sites and natural language questions. The storage schema of the annotated data and the data access method for retrieving answers from the annotated data are also reported in the paper.

pdf bib
The QALL-ME Benchmark: a Multilingual Resource of Annotated Spoken Requests for Question Answering
Elena Cabrio | Milen Kouylekov | Bernardo Magnini | Matteo Negri | Laura Hasler | Constantin Orasan | David Tomás | Jose Luis Vicedo | Guenter Neumann | Corinna Weber
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the QALL-ME benchmark, a multilingual resource of annotated spoken requests in the tourism domain, freely available for research purposes. The languages currently involved in the project are Italian, English, Spanish and German. It introduces a semantic annotation scheme for spoken information access requests, specifically derived from Question Answering (QA) research. In addition to pragmatic and semantic annotations, we propose three QA-based annotation levels: the Expected Answer Type, the Expected Answer Quantifier and the Question Topical Target of a request, to fully capture the content of a request and extract the sought-after information. The QALL-ME benchmark is developed under the EU-FP6 QALL-ME project which aims at the realization of a shared and distributed infrastructure for Question Answering (QA) systems on mobile devices (e.g. mobile phones). Questions are formulated by the users in free natural language input, and the system returns the actual sequence of words which constitutes the answer from a collection of information sources (e.g. documents, databases). Within this framework, the benchmark has the twofold purpose of training machine learning based applications for QA, and testing their actual performance with a rapid turnaround in controlled laboratory setting.

pdf bib
Entailment-based Question Answering for Structured Data
Bogdan Sacaleanu | Constantin Orasan | Christian Spurk | Shiyan Ou | Oscar Ferrandez | Milen Kouylekov | Matteo Negri
Coling 2008: Companion volume: Demonstrations

2006

pdf bib
I-CAB: the Italian Content Annotation Bank
B. Magnini | E. Pianta | C. Girardi | M. Negri | L. Romano | M. Speranza | V. Bartalesi Lenzi | R. Sprugnoli
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we present work in progress for the creation of the Italian Content Annotation Bank (I-CAB), a corpus of Italian news annotated with semantic information at different levels. The first level is represented by temporal expressions, the second level is represented by different types of entities (i.e. person, organizations, locations and geo-political entities), and the third level is represented by relations between entities (e.g. the affiliation relation connecting a person to an organization). So far I-CAB has been manually annotated with temporal expressions, person entities and organization entities. As we intend I-CAB to become a benchmark for various automatic Information Extraction tasks, we followed a policy of reusing already available markup languages. In particular, we adopted the annotation schemes developed for the ACE Entity Detection and Time Expressions Recognition and Normalization tasks. As the ACE guidelines have originally been developed for English, part of the effort consisted in adapting them to the specific morpho-syntactic features of Italian. Finally, we have extended them to include a wider range of entities, such as conjunctions.

pdf bib
Evaluating Knowledge-based Approaches to the Multilingual Extension of a Temporal Expression Normalizer
Matteo Negri | Estela Saquete | Patricio Martínez-Barco | Rafael Muñoz
Proceedings of the Workshop on Annotating and Reasoning about Time and Events

pdf bib
Multilingual Extension of a Temporal Expression Normalizer using Annotated Corpora
E. Saquete | P. Martínez-Barco | R. Muñoz | M. Negri | M. Speranza | R. Sprugnoli
Proceedings of the Cross-Language Knowledge Induction Workshop

2004

pdf bib
Multilingual Pattern Libraries for Question Answering: a Case Study for Definition Questions
Hristo Tanev | Milen Kouylekov | Matteo Negri | Bonaventura Coppola | Bernardo Magnini
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2002

pdf bib
A WordNet-Based Approach to Named Entites Recognition
Bernardo Magnini | Matteo Negri | Roberto Prevete | Hristo Tanev
COLING-02: SEMANET: Building and Using Semantic Networks

pdf bib
Is It the Right Answer? Exploiting Web Redundancy for Answer Validation
Bernardo Magnini | Matteo Negri | Roberto Prevete | Hristo Tanev
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

pdf bib
Towards Automatic Evaluation of Question/Answering Systems
Bernardo Magnini | Matteo Negri | Roberto Prevete | Hristo Tanev
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

Search
Co-authors