Marta R. Costa-jussà

Also published as: Marta Ruiz Costa-jussà, Marta R. Costa-jussà, Marta R. Costa-Jussa, Marta R. Costa-Jussà, Marta Ruiz Costa-jussà


2020

pdf bib
E-Commerce Content and Collaborative-based Recommendation using K-Nearest Neighbors and Enriched Weighted Vectors
Bardia Rafieian | Marta R. Costa-jussà
Proceedings of Workshop on Natural Language Processing in E-Commerce

In this paper, we present two productive and functional recommender methods to improve the ac- curacy of predicting the right product for the user. One proposal is a survey-based recommender system that uses k-nearest neighbors. It recommends products by asking questions from the user, efficiently applying a binary product vector to the product attributes, and processing the request with a minimum error. The second proposal uses an enriched collaborative-based recommender system using enriched weighted vectors. Thanks to the style rules, the enriched collaborative- based method recommends outfits with competitive recommendation quality. We evaluated both of the proposals on a Kaggle fashion-dataset along with iMaterialist and, results show equivalent performance on binary gender and product attributes.

pdf bib
Combining Subword Representations into Word-level Representations in the Transformer Architecture
Noe Casas | Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

In Neural Machine Translation, using word-level tokens leads to degradation in translation quality. The dominant approaches use subword-level tokens, but this increases the length of the sequences and makes it difficult to profit from word-level information such as POS tags or semantic dependencies. We propose a modification to the Transformer model to combine subword-level representations into word-level ones in the first layers of the encoder, reducing the effective length of the sequences in the following layers and providing a natural point to incorporate extra word-level information. Our experiments show that this approach maintains the translation quality with respect to the normal Transformer model when no extra word-level information is injected and that it is superior to the currently dominant method for incorporating word-level source language information to models based on subword-level vocabularies.

pdf bib
Enhancing Word Embeddings with Knowledge Extracted from Lexical Resources
Magdalena Biesialska | Bardia Rafieian | Marta R. Costa-jussà
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

In this work, we present an effective method for semantic specialization of word vector representations. To this end, we use traditional word embeddings and apply specialization methods to better capture semantic relations between words. In our approach, we leverage external knowledge from rich lexical resources such as BabelNet. We also show that our proposed post-specialization method based on an adversarial neural network with the Wasserstein distance allows to gain improvements over state-of-the-art methods on two tasks: word similarity and dialog state tracking.

pdf bib
Syntax-driven Iterative Expansion Language Models for Controllable Text Generation
Noe Casas | José A. R. Fonollosa | Marta R. Costa-jussà
Proceedings of the Fourth Workshop on Structured Prediction for NLP

The dominant language modeling paradigm handles text as a sequence of discrete tokens. While that approach can capture the latent structure of the text, it is inherently constrained to sequential dynamics for text generation. We propose a new paradigm for introducing a syntactic inductive bias into neural text generation, where the dependency parse tree is used to drive the Transformer model to generate sentences iteratively. Our experiments show that this paradigm is effective at text generation, with quality between LSTMs and Transformers, and comparable diversity, requiring less than half their decoding steps, and its generation process allows direct control over the syntactic constructions of the generated text, enabling the induction of stylistic variations.

pdf bib
Abusive language in Spanish children and young teenager’s conversations: data preparation and short text classification with contextual word embeddings
Marta R. Costa-jussà | Esther González | Asuncion Moreno | Eudald Cumalat
Proceedings of the 12th Language Resources and Evaluation Conference

Abusive texts are reaching the interests of the scientific and social community. How to automatically detect them is onequestion that is gaining interest in the natural language processing community. The main contribution of this paper is toevaluate the quality of the recently developed ”Spanish Database for cyberbullying prevention” for the purpose of trainingclassifiers on detecting abusive short texts. We compare classical machine learning techniques to the use of a more ad-vanced model: the contextual word embeddings in the particular case of classification of abusive short-texts for the Spanishlanguage. As contextual word embeddings, we use Bidirectional Encoder Representation from Transformers (BERT), pro-posed at the end of 2018. We show that BERT mostly outperforms classical techniques. Far beyond the experimentalimpact of our research, this project aims at planting the seeds for an innovative technological tool with a high potentialsocial impact and aiming at being part of the initiatives in artificial intelligence for social good.

pdf bib
GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies
Marta R. Costa-jussà | Pau Li Lin | Cristina España-Bonet
Proceedings of the 12th Language Resources and Evaluation Conference

We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite the gender inequalities present in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and different domains), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a high-quality dataset for machine translation evaluation. While GeBioCorpus aims at being one of the first non-synthetic gender-balanced test datasets, GeBioToolkit aims at paving the path to standardize procedures to produce gender-balanced datasets.

pdf bib
Automatic Spanish Translation of SQuAD Dataset for Multi-lingual Question Answering
Casimiro Pio Carrino | Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the 12th Language Resources and Evaluation Conference

Recently, multilingual question answering became a crucial research topic, and it is receiving increased interest in the NLP community. However, the unavailability of large-scale datasets makes it challenging to train multilingual QA systems with performance comparable to the English ones. In this work, we develop the Translate Align Retrieve (TAR) method to automatically translate the Stanford Question Answering Dataset (SQuAD) v1.1 to Spanish. We then used this dataset to train Spanish QA systems by fine-tuning a Multilingual-BERT model. Finally, we evaluated our QA models with the recently proposed MLQA and XQuAD benchmarks for cross-lingual Extractive QA. Experimental results show that our models outperform the previous Multilingual-BERT baselines achieving the new state-of-the-art values of 68.1 F1 on the Spanish MLQA corpus and 77.6 F1 on the Spanish XQuAD corpus. The resulting, synthetically generated SQuAD-es v1.1 corpora, with almost 100% of data contained in the original English version, to the best of our knowledge, is the first large-scale QA training resource for Spanish.

pdf bib
Multilingual and Interlingual Semantic Representations for Natural Language Processing: A Brief Introduction
Marta R. Costa-jussà | Cristina España-Bonet | Pascale Fung | Noah A. Smith
Computational Linguistics, Volume 46, Issue 2 - June 2020

We introduce the Computational Linguistics special issue on Multilingual and Interlingual Semantic Representations for Natural Language Processing. We situate the special issue’s five articles in the context of our fast-changing field, explaining our motivation for this project. We offer a brief summary of the work in the issue, which includes developments on lexical and sentential semantic representations, from symbolic and neural perspectives.

pdf bib
Continual Lifelong Learning in Natural Language Processing: A Survey
Magdalena Biesialska | Katarzyna Biesialska | Marta R. Costa-jussà
Proceedings of the 28th International Conference on Computational Linguistics

Continual learning (CL) aims to enable information systems to learn from a continuous data stream across time. However, it is difficult for existing deep learning architectures to learn a new task without largely forgetting previously acquired knowledge. Furthermore, CL is particularly challenging for language learning, as natural language is ambiguous: it is discrete, compositional, and its meaning is context-dependent. In this work, we look at the problem of CL through the lens of various NLP tasks. Our survey discusses major challenges in CL and current methods applied in neural network models. We also provide a critical review of the existing CL evaluation methods and datasets in NLP. Finally, we present our outlook on future research directions.

bib
Towards Mitigating Gender Bias in a decoder-based Neural Machine Translation model by Adding Contextual Information
Christine Basta | Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the The Fourth Widening Natural Language Processing Workshop

Gender bias negatively impacts many natural language processing applications, including machine translation (MT). The motivation behind this work is to study whether recent proposed MT techniques are significantly contributing to attenuate biases in document-level and gender-balanced data. For the study, we consider approaches of adding the previous sentence and the speaker information, implemented in a decoder-based neural MT system. We show improvements both in translation quality (+1 BLEU point) as well as in gender bias mitigation on WinoMT (+5% accuracy).

2019

pdf bib
Multilingual, Multi-scale and Multi-layer Visualization of Intermediate Representations
Carlos Escolano | Marta R. Costa-jussà | Elora Lacroux | Pere-Pau Vázquez
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

The main alternatives nowadays to deal with sequences are Recurrent Neural Networks (RNN) architectures and the Transformer. In this context, Both RNN’s and Transformer have been used as an encoder-decoder architecture with multiple layers in each module. Far beyond this, these architectures are the basis for the contextual word embeddings which are revolutionizing most natural language downstream applications. However, intermediate representations in either the RNN or Transformer architectures can be difficult to interpret. To make these layer representations more accessible and meaningful, we introduce a web-based tool that visualizes them both at the sentence and token level. We present three use cases. The first analyses gender issues in contextual word embeddings. The second and third are showing multilingual intermediate representations for sentences and tokens and the evolution of these intermediate representations along with the multiple layers of the decoder and in the context of multilingual machine translation.

pdf bib
From Bilingual to Multilingual Neural Machine Translation by Incremental Training
Carlos Escolano | Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Multilingual Neural Machine Translation approaches are based on the use of task specific models and the addition of one more language can only be done by retraining the whole system. In this work, we propose a new training schedule that allows the system to scale to more languages without modification of the previous components based on joint training and language-independent encoder/decoder modules allowing for zero-shot translation. This work in progress shows close results to state-of-the-art in the WMT task.

pdf bib
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
Marta R. Costa-jussà | Enrique Alfonseca
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Proceedings of the First Workshop on Gender Bias in Natural Language Processing
Marta R. Costa-jussà | Christian Hardmeier | Will Radford | Kellie Webster
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

pdf bib
Gendered Ambiguous Pronoun (GAP) Shared Task at the Gender Bias in NLP Workshop 2019
Kellie Webster | Marta R. Costa-jussà | Christian Hardmeier | Will Radford
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

The 1st ACL workshop on Gender Bias in Natural Language Processing included a shared task on gendered ambiguous pronoun (GAP) resolution. This task was based on the coreference challenge defined in Webster et al. (2018), designed to benchmark the ability of systems to resolve pronouns in real-world contexts in a gender-fair way. 263 teams competed via a Kaggle competition, with the winning system achieving logloss of 0.13667 and near gender parity. We review the approaches of eleven systems with accepted description papers, noting their effective use of BERT (Devlin et al., 2018), both via fine-tuning and for feature extraction, as well as ensembling.

pdf bib
Evaluating the Underlying Gender Bias in Contextualized Word Embeddings
Christine Basta | Marta R. Costa-jussà | Noe Casas
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

Gender bias is highly impacting natural language processing applications. Word embeddings have clearly been proven both to keep and amplify gender biases that are present in current data sources. Recently, contextualized word embeddings have enhanced previous word embedding techniques by computing word vector representations dependent on the sentence they appear in. In this paper, we study the impact of this conceptual change in the word embedding computation in relation with gender bias. Our analysis includes different measures previously applied in the literature to standard word embeddings. Our findings suggest that contextualized word embeddings are less biased than standard ones even when the latter are debiased.

pdf bib
BERT Masked Language Modeling for Co-reference Resolution
Felipe Alfaro | Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

This paper explains the TALP-UPC participation for the Gendered Pronoun Resolution shared-task of the 1st ACL Workshop on Gender Bias for Natural Language Processing. We have implemented two models for mask language modeling using pre-trained BERT adjusted to work for a classification problem. The proposed solutions are based on the word probabilities of the original BERT model, but using common English names to replace the original test names.

pdf bib
Equalizing Gender Bias in Neural Machine Translation with Word Embeddings Techniques
Joel Escudé Font | Marta R. Costa-jussà
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

Neural machine translation has significantly pushed forward the quality of the field. However, there are remaining big issues with the output translations and one of them is fairness. Neural models are trained on large text corpora which contain biases and stereotypes. As a consequence, models inherit these social biases. Recent methods have shown results in reducing gender bias in other natural language processing tools such as word embeddings. We take advantage of the fact that word embeddings are used in neural machine translation to propose a method to equalize gender biases in neural machine translation using these representations. Specifically, we propose, experiment and analyze the integration of two debiasing techniques over GloVe embeddings in the Transformer translation architecture. We evaluate our proposed system on the WMT English-Spanish benchmark task, showing gains up to one BLEU point. As for the gender bias evaluation, we generate a test set of occupations and we show that our proposed system learns to equalize existing biases from the baseline system.

pdf bib
Findings of the 2019 Conference on Machine Translation (WMT19)
Loïc Barrault | Ondřej Bojar | Marta R. Costa-jussà | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Philipp Koehn | Shervin Malmasi | Christof Monz | Mathias Müller | Santanu Pal | Matt Post | Marcos Zampieri
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation.

pdf bib
The TALP-UPC Machine Translation Systems for WMT19 News Translation Task: Pivoting Techniques for Low Resource MT
Noe Casas | José A. R. Fonollosa | Carlos Escolano | Christine Basta | Marta R. Costa-jussà
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

In this article, we describe the TALP-UPC research group participation in the WMT19 news translation shared task for Kazakh-English. Given the low amount of parallel training data, we resort to using Russian as pivot language, training subword-based statistical translation systems for Russian-Kazakh and Russian-English that were then used to create two synthetic pseudo-parallel corpora for Kazakh-English and English-Kazakh respectively. Finally, a self-attention model based on the decoder part of the Transformer architecture was trained on the two pseudo-parallel corpora.

pdf bib
Terminology-Aware Segmentation and Domain Feature for the WMT19 Biomedical Translation Task
Casimiro Pio Carrino | Bardia Rafieian | Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

In this work, we give a description of the TALP-UPC systems submitted for the WMT19 Biomedical Translation Task. Our proposed strategy is NMT model-independent and relies only on one ingredient, a biomedical terminology list. We first extracted such a terminology list by labelling biomedical words in our training dataset using the BabelNet API. Then, we designed a data preparation strategy to insert the terms information at a token level. Finally, we trained the Transformer model with this terms-informed data. Our best-submitted system ranked 2nd and 3rd for Spanish-English and English-Spanish translation directions, respectively.

pdf bib
The TALP-UPC System for the WMT Similar Language Task: Statistical vs Neural Machine Translation
Magdalena Biesialska | Lluis Guardia | Marta R. Costa-jussà
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

Although the problem of similar language translation has been an area of research interest for many years, yet it is still far from being solved. In this paper, we study the performance of two popular approaches: statistical and neural. We conclude that both methods yield similar results; however, the performance varies depending on the language pair. While the statistical approach outperforms the neural one by a difference of 6 BLEU points for the Spanish-Portuguese language pair, the proposed neural model surpasses the statistical one by a difference of 2 BLEU points for Czech-Polish. In the former case, the language similarity (based on perplexity) is much higher than in the latter case. Additionally, we report negative results for the system combination with back-translation. Our TALP-UPC system submission won 1st place for Czech->Polish and 2nd place for Spanish->Portuguese in the official evaluation of the 1st WMT Similar Language Translation task.

2018

pdf bib
A Neural Approach to Language Variety Translation
Marta R. Costa-jussà | Marcos Zampieri | Santanu Pal
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

In this paper we present the first neural-based machine translation system trained to translate between standard national varieties of the same language. We take the pair Brazilian - European Portuguese as an example and compare the performance of this method to a phrase-based statistical machine translation system. We report a performance improvement of 0.9 BLEU points in translating from European to Brazilian Portuguese and 0.2 BLEU points when translating in the opposite direction. We also carried out a human evaluation experiment with native speakers of Brazilian Portuguese which indicates that humans prefer the output produced by the neural-based system in comparison to the statistical system.

pdf bib
The TALP-UPC Machine Translation Systems for WMT18 News Shared Translation Task
Noe Casas | Carlos Escolano | Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

In this article we describe the TALP-UPC research group participation in the WMT18 news shared translation task for Finnish-English and Estonian-English within the multi-lingual subtrack. All of our primary submissions implement an attention-based Neural Machine Translation architecture. Given that Finnish and Estonian belong to the same language family and are similar, we use as training data the combination of the datasets of both language pairs to paliate the data scarceness of each individual pair. We also report the translation quality of systems trained on individual language pair data to serve as baseline and comparison reference.

pdf bib
Neural Machine Translation with the Transformer and Multi-Source Romance Languages for the Biomedical WMT 2018 task
Brian Tubay | Marta R. Costa-jussà
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The Transformer architecture has become the state-of-the-art in Machine Translation. This model, which relies on attention-based mechanisms, has outperformed previous neural machine translation architectures in several tasks. In this system description paper, we report details of training neural machine translation with multi-source Romance languages with the Transformer model and in the evaluation frame of the biomedical WMT 2018 task. Using multi-source languages from the same family allows improvements of over 6 BLEU points.

2017

pdf bib
Why Catalan-Spanish Neural Machine Translation? Analysis, comparison and combination with standard Rule and Phrase-based technologies
Marta R. Costa-jussà
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

Catalan and Spanish are two related languages given that both derive from Latin. They share similarities in several linguistic levels including morphology, syntax and semantics. This makes them particularly interesting for the MT task. Given the recent appearance and popularity of neural MT, this paper analyzes the performance of this new approach compared to the well-established rule-based and phrase-based MT systems. Experiments are reported on a large database of 180 million words. Results, in terms of standard automatic measures, show that neural MT clearly outperforms the rule-based and phrase-based MT system on in-domain test set, but it is worst in the out-of-domain test set. A naive system combination specially works for the latter. In-domain manual analysis shows that neural MT tends to improve both adequacy and fluency, for example, by being able to generate more natural translations instead of literal ones, choosing to the adequate target word when the source word has several translations and improving gender agreement. However, out-of-domain manual analysis shows how neural MT is more affected by unknown words or contexts.

pdf bib
Byte-based Neural Machine Translation
Marta R. Costa-jussà | Carlos Escolano | José A. R. Fonollosa
Proceedings of the First Workshop on Subword and Character Level Models in NLP

This paper presents experiments comparing character-based and byte-based neural machine translation systems. The main motivation of the byte-based neural machine translation system is to build multi-lingual neural machine translation systems that can share the same vocabulary. We compare the performance of both systems in several language pairs and we see that the performance in test is similar for most language pairs while the training time is slightly reduced in the case of byte-based neural machine translation.

pdf bib
The TALP-UPC Neural Machine Translation System for German/Finnish-English Using the Inverse Direction Model in Rescoring
Carlos Escolano | Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the Second Conference on Machine Translation

pdf bib
Character-level Intra Attention Network for Natural Language Inference
Han Yang | Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP

Natural language inference (NLI) is a central problem in language understanding. End-to-end artificial neural networks have reached state-of-the-art performance in NLI field recently. In this paper, we propose Character-level Intra Attention Network (CIAN) for the NLI task. In our model, we use the character-level convolutional network to replace the standard word embedding layer, and we use the intra attention to capture the intra-sentence semantics. The proposed CIAN model provides improved results based on a newly published MNLI corpus.

2016

pdf bib
Integration of machine translation paradigms
Marta R. Costa-jussà
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

pdf bib
Character-based Neural Machine Translation
Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
The TALPUPC Spanish–English WMT Biomedical Task: Bilingual Embeddings and Char-based Neural Language Model Rescoring in a Phrase-based System
Marta R. Costa-jussà | Cristina España-Bonet | Pranava Madhyastha | Carlos Escolano | José A. R. Fonollosa
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
WMT 2016 Multimodal Translation System Description based on Bidirectional Recurrent Neural Networks with Double-Embeddings
Sergio Rodríguez Guasch | Marta R. Costa-jussà
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Moses-based official baseline for NEWS 2016
Marta R. Costa-jussà
Proceedings of the Sixth Named Entity Workshop

pdf bib
Proceedings of the Sixth Workshop on Hybrid Approaches to Translation (HyTra6)
Patrik Lambert | Bogdan Babych | Kurt Eberle | Rafael E. Banchs | Reinhard Rapp | Marta R. Costa-jussà
Proceedings of the Sixth Workshop on Hybrid Approaches to Translation (HyTra6)

2015

pdf bib
Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)
Bogdan Babych | Kurt Eberle | Patrik Lambert | Reinhard Rapp | Rafael E. Banchs | Marta R. Costa-jussà
Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)

pdf bib
Ongoing Study for Enhancing Chinese-Spanish Translation with Morphology Strategies
Marta R. Costa-jussà
Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)

2014

pdf bib
CHISPA on the GO: A mobile Chinese-Spanish translation service for travellers in trouble
Jordi Centelles | Marta R. Costa-jussà | Rafael E. Banchs
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra)
Rafael E. Banchs | Marta R. Costa-jussà | Reinhard Rapp | Patrik Lambert | Kurt Eberle | Bogdan Babych
Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra)

pdf bib
Chinese-to-Spanish rule-based machine translation system
Jordi Centelles | Marta R. Costa-jussà
Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra)

pdf bib
English-to-Hindi system description for WMT 2014: Deep Source-Context Features for Moses
Marta R. Costa-jussà | Parth Gupta | Paolo Rosso | Rafael E. Banchs
Proceedings of the Ninth Workshop on Statistical Machine Translation

2013

pdf bib
The TALP-UPC Phrase-Based Translation Systems for WMT13: System Combination with Morphology Generation, Domain Adaptation and Corpus Filtering
Lluís Formiga | Marta R. Costa-jussà | José B. Mariño | José A. R. Fonollosa | Alberto Barrón-Cedeño | Lluís Màrquez
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
Proceedings of the Second Workshop on Hybrid Approaches to Translation
Marta Ruiz Costa-jussà | Reinhard Rapp | Patrik Lambert | Kurt Eberle | Rafael E. Banchs | Bogdan Babych
Proceedings of the Second Workshop on Hybrid Approaches to Translation

pdf bib
Workshop on Hybrid Approaches to Translation: Overview and Developments
Marta R. Costa-jussà | Rafael Banchs | Reinhard Rapp | Patrik Lambert | Kurt Eberle | Bogdan Babych
Proceedings of the Second Workshop on Hybrid Approaches to Translation

pdf bib
Morphological, Syntactical and Semantic Knowledge in Statistical Machine Translation
Marta Ruiz Costa-jussà | Chris Quirk
NAACL HLT 2013 Tutorial Abstracts

2012

pdf bib
A Richly Annotated, Multilingual Parallel Corpus for Hybrid Machine Translation
Eleftherios Avramidis | Marta R. Costa-jussà | Christian Federmann | Josef van Genabith | Maite Melero | Pavel Pecina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In recent years, machine translation (MT) research has focused on investigating how hybrid machine translation as well as system combination approaches can be designed so that the resulting hybrid translations show an improvement over the individual “component” translations. As a first step towards achieving this objective we have developed a parallel corpus with source text and the corresponding translation output from a number of machine translation engines, annotated with metadata information, capturing aspects of the translation process performed by the different MT systems. This corpus aims to serve as a basic resource for further research on whether hybrid machine translation algorithms and system combination techniques can benefit from additional (linguistically motivated, decoding, and runtime) information provided by the different systems involved. In this paper, we describe the annotated corpus we have created. We provide an overview on the component MT systems and the XLIFF-based annotation format we have developed. We also report on first experiments with the ML4HMT corpus data.

pdf bib
Holaaa!! writin like u talk is kewl but kinda hard 4 NLP
Maite Melero | Marta R. Costa-Jussà | Judith Domingo | Montse Marquina | Martí Quixal
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present work in progress aiming to build tools for the normalization of User-Generated Content (UGC). As we will see, the task requires the revisiting of the initial steps of NLP processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user texts) presents a number of non-standard communicative and linguistic characteristics, and is in fact much closer to oral and colloquial language than to edited text. We present and characterize a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging, and finally we propose a strategy for automatically normalizing UGC using a selector of correct forms on top of a pre-existing spell-checker.

pdf bib
BUCEADOR, a multi-language search engine for digital libraries
Jordi Adell | Antonio Bonafonte | Antonio Cardenal | Marta R. Costa-Jussà | José A. R. Fonollosa | Asunción Moreno | Eva Navas | Eduardo R. Banga
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents a web-based multimedia search engine built within the Buceador (www.buceador.org) research project. A proof-of-concept tool has been implemented which is able to retrieve information from a digital library made of multimedia documents in the 4 official languages in Spain (Spanish, Basque, Catalan and Galician). The retrieved documents are presented in the user language after translation and dubbing (the four previous languages + English). The paper presents the tool functionality, the architecture, the digital library and provide some information about the technology involved in the fields of automatic speech recognition, statistical machine translation, text-to-speech synthesis and information retrieval. Each technology has been adapted to the purposes of the presented tool as well as to interact with the rest of the technologies involved.

pdf bib
The ML4HMT Workshop on Optimising the Division of Labour in Hybrid Machine Translation
Christian Federmann | Eleftherios Avramidis | Marta R. Costa-jussà | Josef van Genabith | Maite Melero | Pavel Pecina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe the “Shared Task on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid Machine Translation” (ML4HMT) which aims to foster research on improved system combination approaches for machine translation (MT). Participants of the challenge are requested to build hybrid translations by combining the output of several MT systems of different types. We first describe the ML4HMT corpus used in the shared task, then explain the XLIFF-based annotation format we have designed for it, and briefly summarize the participating systems. Using both automated metrics scores and extensive manual evaluation, we discuss the individual performance of the various systems. An interesting result from the shared task is the fact that we were able to observe different systems winning according to the automated metrics scores when compared to the results from the manual evaluation. We conclude by summarising the first edition of the challenge and by giving an outlook to future work.

pdf bib
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
Marta R. Costa-jussà | Patrik Lambert | Rafael E. Banchs | Reinhard Rapp | Bogdan Babych
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

pdf bib
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT
Josef van Genabith | Toni Badia | Christian Federmann | Maite Melero | Marta R. Costa-jussà | Tsuyoshi Okita
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

pdf bib
Results from the ML4HMT-12 Shared Task on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid Machine Translation
Christian Federmann | Tsuyoshi Okita | Maite Melero | Marta R. Costa-Jussa | Toni Badia | Josef van Genabith
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

2011

pdf bib
A Semantic Feature for Statistical Machine Translation
Rafael E. Banchs | Marta R. Costa-jussà
Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
The BM-I2R Haitian-Créole-to-English translation system description for the WMT 2011 evaluation campaign
Marta R. Costa-jussà | Rafael E. Banchs
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Enhancing scarce-resource language translation through pivot combinations
Marta R. Costa-jussà | Carlos Henríquez | Rafael E. Banchs
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
Linguistic-based Evaluation Criteria to identify Statistical Machine Translation Errors
Mireia Farrús | Marta R. Costa-jussà | José B. Mariño | José A. R. Fonollosa
Proceedings of the 14th Annual conference of the European Association for Machine Translation

pdf bib
Opinion Mining of Spanish Customer Comments with Non-Expert Annotations on Mechanical Turk
Bart Mellebeek | Francesc Benavent | Jens Grivolla | Joan Codina | Marta R. Costa-jussà | Rafael Banchs
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
Using Collocation Segmentation to Augment the Phrase Table
Carlos A. Henríquez Q. | Marta Ruiz Costa-jussà | Vidas Daudaravicius | Rafael E. Banchs | José B. Mariño
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib
Using Linear Interpolation and Weighted Reordering Hypotheses in the Moses System
Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper proposes to introduce a novel reordering model in the open-source Moses toolkit. The main idea is to provide weighted reordering hypotheses to the SMT decoder. These hypotheses are built using a first-step Ngram-based SMT translation from a source language into a third representation that is called reordered source language. Each hypothesis has its own weight provided by the Ngram-based decoder. This proposed reordering technique offers a better and more efficient translation when compared to both the distance-based and the lexicalized reordering. In addition to this reordering approach, this paper describes a domain adaptation technique which is based on a linear combination of an specific in-domain and an extra out-domain translation models. Results for both approaches are reported in the Arabic-to-English 2008 IWSLT task. When implementing the weighted reordering hypotheses and the domain adaptation technique in the final translation system, translation results reach improvements up to 2.5 BLEU compared to a standard state-of-the-art Moses baseline system.

pdf bib
Automatic and Human Evaluation Study of a Rule-based and a Statistical Catalan-Spanish Machine Translation Systems
Marta R. Costa-jussà | Mireia Farrús | José B. Mariño | José A. R. Fonollosa
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Machine translation systems can be classified into rule-based and corpus-based approaches, in terms of their core technology. Since both paradigms have largely been used during the last years, one of the aims in the research community is to know how these systems differ in terms of translation quality. To this end, this paper reports a study and comparison of a rule-based and a corpus-based (particularly, statistical) Catalan-Spanish machine translation systems, both of them freely available in the web. The translation quality analysis is performed under two different domains: journalistic and medical. The systems are evaluated by using standard automatic measures, as well as by native human evaluators. Automatic results show that the statistical system performs better than the rule-based system. Human judgements show that in the Spanish-to-Catalan direction the statistical system also performs better than the rule-based system, while in the Catalan-to-Spanish direction is the other way round. Although the statistical system obtains the best automatic scores, its errors tend to be more penalized by human judgements than the errors of the rule-based system. This can be explained because statistical errors are usually unexpected and they do not follow any pattern.

2009

pdf bib
The TALP-UPC Phrase-Based Translation System for EACL-WMT 2009
José A. R. Fonollosa | Maxim Khalilov | Marta R. Costa-jussà | José B. Mariño | Carlos A. Henráquez Q. | Adolfo Hernández H. | Rafael E. Banchs
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Improving a Catalan-Spanish Statistical Translation System using Morphosyntactic Knowledge
Mireia Farrús | Marta R. Costa-jussà | Marc Poch | Adolfo Hernández | José B. Mariño
Proceedings of the 13th Annual conference of the European Association for Machine Translation

2008

pdf bib
Using Reordering in Statistical Machine Translation based on Alignment Block Classification
Marta R. Costa-jussà | José A. R. Fonollosa | Enric Monte
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Statistical Machine Translation (SMT) is based on alignment models which learn from bilingual corpora the word correspondences between source and target language. These models are assumed to be capable of learning reorderings of sequences of words. However, the difference in word order between two languages is one of the most important sources of errors in SMT. This paper proposes a Recursive Alignment Block Classification algorithm (RABCA) that can take advantage of inductive learning in order to solve reordering problems. This algorithm should be able to cope with swapping examples seen during training; it should infer properties that might permit to reorder pairs of blocks (sequences of words) which did not appear during training; and finally it should be robust with respect to training errors and ambiguities. Experiments are reported on the EuroParl task and RABCA is tested using two state-of-the-art SMT systems: a phrased-based and an Ngram-based. In both cases, RABCA improves results.

pdf bib
The TALP-UPC Ngram-Based Statistical Machine Translation System for ACL-WMT 2008
Maxim Khalilov | Adolfo Hernández H. | Marta R. Costa-jussà | Josep M. Crego | Carlos A. Henríquez Q. | Patrik Lambert | José A. R. Fonollosa | José B. Mariño | Rafael E. Banchs
Proceedings of the Third Workshop on Statistical Machine Translation

2007

pdf bib
Analysis and System Combination of Phrase- and N-Gram-Based Statistical Machine Translation Systems
Marta R. Costa-jussà | Josep M. Crego | David Vilar | José A. R. Fonollosa | José B. Mariño | Hermann Ney
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
Ngram-Based Statistical Machine Translation Enhanced with Multiple Weighted Reordering Hypotheses
Marta R. Costa-jussà | Josep M. Crego | Patrik Lambert | Maxim Khalilov | José A. R. Fonollosa | José B. Mariño | Rafael E. Banchs
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
Analysis of Statistical and Morphological Classes to Generate Weigthed Reordering Hypotheses on a Statistical Machine Translation System
Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
Smooth Bilingual N-Gram Translation
Holger Schwenk | Marta R. Costa-jussà | Jose A. R. Fonollosa
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib
N-gram-based Machine Translation
José Mariño | Rafael E. Banchs | Josep M. Crego | Adrià de Gispert | Patrik Lambert | José A. R. Fonollosa | Marta R. Costa-jussà
Computational Linguistics, Volume 32, Number 4, December 2006

pdf bib
Statistical Machine Reordering
Marta R. Costa-jussà | José A. R. Fonollosa
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

pdf bib
TALP Phrase-based statistical translation system for European language pairs
Marta R. Costa-jussà | Josep M. Crego | Adrià de Gispert | Patrik Lambert | Maxim Khalilov | José B. Mariño | José A. R. Fonollosa | Rafael Banchs
Proceedings on the Workshop on Statistical Machine Translation

pdf bib
N-gram-based SMT System Enhanced with Reordering Patterns
Josep M. Crego | Adrià de Gispert | Patrik Lambert | Marta R. Costa-jussà | Maxim Khalilov | Rafael Banchs | José B. Mariño | José A. R. Fonollosa
Proceedings on the Workshop on Statistical Machine Translation

2005

pdf bib
Improving Phrase-Based Statistical Translation by Modifying Phrase Extraction and Including Several Features
Marta Ruiz Costa-jussà | José A. R. Fonollosa
Proceedings of the ACL Workshop on Building and Using Parallel Texts

Search