Alberto Poncelas


2020

pdf bib
Multiple Segmentations of Thai Sentences for Neural Machine Translation
Alberto Poncelas | Wichaya Pidchamook | Chao-Hong Liu | James Hadley | Andy Way
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Thai is a low-resource language, so it is often the case that data is not available in sufficient quantities to train an Neural Machine Translation (NMT) model which perform to a high level of quality. In addition, the Thai script does not use white spaces to delimit the boundaries between words, which adds more complexity when building sequence to sequence models. In this work, we explore how to augment a set of English–Thai parallel data by replicating sentence-pairs with different word segmentation methods on Thai, as training data for NMT model training. Using different merge operations of Byte Pair Encoding, different segmentations of Thai sentences can be obtained. The experiments show that combining these datasets, performance is improved for NMT models trained with a dataset that has been split using a supervised splitting tool.

pdf bib
Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation
Xabier Soto | Dimitar Shterionov | Alberto Poncelas | Andy Way
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Machine translation (MT) has benefited from using synthetic training data originating from translating monolingual corpora, a technique known as backtranslation. Combining backtranslated data from different sources has led to better results than when using such data in isolation. In this work we analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems. We use a real-world low-resource use-case (Basque-to-Spanish in the clinical domain) as well as a high-resource language pair (German-to-English) to test different scenarios with backtranslation and employ data selection to optimise the synthetic corpora. We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems. We further tune the data selection method by taking into account the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora. Our experiments show that incorporating backtranslated data from different sources can be beneficial, and that availing of data selection can yield improved performance.

pdf bib
The Impact of Indirect Machine Translation on Sentiment Classification
Alberto Poncelas | Pintu Lohar | James Hadley | Andy Way
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
Neural Machine Translation for translating into Croatian and Serbian
Maja Popović | Alberto Poncelas | Marija Brkic | Andy Way
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

In this work, we systematically investigate different set-ups for training of neural machine translation (NMT) systems for translation into Croatian and Serbian, two closely related South Slavic languages. We explore English and German as source languages, different sizes and types of training corpora, as well as bilingual and multilingual systems. We also explore translation of English IMDb user movie reviews, a domain/genre where only monolingual data are available. First, our results confirm that multilingual systems with joint target languages perform better. Furthermore, translation performance from English is much better than from German, partly because German is morphologically more complex and partly because the corpus consists mostly of parallel human translations instead of original text and its human translation. The translation from German should be further investigated systematically. For translating user reviews, creating synthetic in-domain parallel data through back- and forward-translation and adding them to a small out-of-domain parallel corpus can yield performance comparable with a system trained on a full out-of-domain corpus. However, it is still not clear what is the optimal size of synthetic in-domain data, especially for forward-translated data where the target language is machine translated. More detailed research including manual evaluation and analysis is needed in this direction.

pdf bib
A Tool for Facilitating OCR Postediting in Historical Documents
Alberto Poncelas | Mohammad Aboomar | Jan Buts | James Hadley | Andy Way
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary, 1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.

2019

pdf bib
Transductive Data-Selection Algorithms for Fine-Tuning Neural Machine Translation
Alberto Poncelas | Gideon Maillette de Buy Wenniger | Andy Way
Proceedings of The 8th Workshop on Patent and Scientific Literature Translation

pdf bib
Selecting Artificially-Generated Sentences for Fine-Tuning Neural Machine Translation
Alberto Poncelas | Andy Way
Proceedings of the 12th International Conference on Natural Language Generation

Neural Machine Translation (NMT) models tend to achieve the best performances when larger sets of parallel sentences are provided for training. For this reason, augmenting the training set with artificially-generated sentence pair can boost the performance. Nonetheless, the performance can also be improved with a small number of sentences if they are in the same domain as the test set. Accordingly, we want to explore the use of artificially-generated sentence along with data-selection algorithms to improve NMT models trained solely with authentic data. In this work, we show how artificially-generated sentences can be more beneficial than authentic pairs and what are their advantages when used in combination with data-selection algorithms.

pdf bib
Combining PBSMT and NMT Back-translated Data for Efficient NMT
Alberto Poncelas | Maja Popović | Dimitar Shterionov | Gideon Maillette de Buy Wenniger | Andy Way
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Neural Machine Translation (NMT) models achieve their best performance when large sets of parallel data are used for training. Consequently, techniques for augmenting the training set have become popular recently. One of these methods is back-translation, which consists on generating synthetic sentences by translating a set of monolingual, target-language sentences using a Machine Translation (MT) model. Generally, NMT models are used for back-translation. In this work, we analyze the performance of models when the training data is extended with synthetic data using different MT approaches. In particular we investigate back-translated data generated not only by NMT but also by Statistical Machine Translation (SMT) models and combinations of both. The results reveal that the models achieve the best performances when the training set is augmented with back-translated data created by merging different MT approaches.

2018

pdf bib
SMT versus NMT: Preliminary comparisons for Irish
Meghan Dowling | Teresa Lynn | Alberto Poncelas | Andy Way
Proceedings of the AMTA 2018 Workshop on Technologies for MT of Low Resource Languages (LoResMT 2018)

pdf bib
Extracting In-domain Training Corpora for Neural Machine Translation Using Data Selection Methods
Catarina Cruz Silva | Chao-Hong Liu | Alberto Poncelas | Andy Way
Proceedings of the Third Conference on Machine Translation: Research Papers

Data selection is a process used in selecting a subset of parallel data for the training of machine translation (MT) systems, so that 1) resources for training might be reduced, 2) trained models could perform better than those trained with the whole corpus, and/or 3) trained models are more tailored to specific domains. It has been shown that for statistical MT (SMT), the use of data selection helps improve the MT performance significantly. In this study, we reviewed three data selection approaches for MT, namely Term Frequency– Inverse Document Frequency, Cross-Entropy Difference and Feature Decay Algorithm, and conducted experiments on Neural Machine Translation (NMT) with the selected data using the three approaches. The results showed that for NMT systems, using data selection also improved the performance, though the gain is not as much as for SMT systems.

2017

pdf bib
IJCNLP-2017 Task 4: Customer Feedback Analysis
Chao-Hong Liu | Yasufumi Moriya | Alberto Poncelas | Declan Groves
Proceedings of the IJCNLP 2017, Shared Tasks

This document introduces the IJCNLP 2017 Shared Task on Customer Feedback Analysis. In this shared task we have prepared corpora of customer feedback in four languages, i.e. English, French, Spanish and Japanese. They were annotated in a common meanings categorization, which was improved from an ADAPT-Microsoft pivot study on customer feedback. Twenty teams participated in the shared task and twelve of them have submitted prediction results. The results show that performance of prediction meanings of customer feedback is reasonable well in four languages. Nine system description papers are archived in the shared tasks proceeding.

pdf bib
ADAPT Centre Cone Team at IJCNLP-2017 Task 5: A Similarity-Based Logistic Regression Approach to Multi-choice Question Answering in an Examinations Shared Task
Daria Dzendzik | Alberto Poncelas | Carl Vogel | Qun Liu
Proceedings of the IJCNLP 2017, Shared Tasks

We describe the work of a team from the ADAPT Centre in Ireland in addressing automatic answer selection for the Multi-choice Question Answering in Examinations shared task. The system is based on a logistic regression over the string similarities between question, answer, and additional text. We obtain the highest grade out of six systems: 48.7% accuracy on a validation set (vs. a baseline of 29.45%) and 45.6% on a test set.