Gorka Labaka

Also published as: G. Labaka


2020

pdf bib
A Call for More Rigor in Unsupervised Cross-lingual Learning
Mikel Artetxe | Sebastian Ruder | Dani Yogatama | Gorka Labaka | Eneko Agirre
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We review motivations, definition, approaches, and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them. An existing rationale for such research is based on the lack of parallel data for many of the world’s languages. However, we argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice. We also discuss different training signals that have been used in previous work, which depart from the pure unsupervised setting. We then describe common methodological issues in tuning and evaluation of unsupervised cross-lingual models and present best practices. Finally, we provide a unified outlook for different types of research in this area (i.e., cross-lingual word embeddings, deep multilingual pretraining, and unsupervised machine translation) and argue for comparable evaluation of these models.

pdf bib
Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
Ivana Kvapilíková | Mikel Artetxe | Gorka Labaka | Eneko Agirre | Ondřej Bojar
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM) to derive the multilingual sentence representations. The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM. In addition, we observe that a single synthetic bilingual corpus is able to improve results for other language pairs.

pdf bib
Translation Artifacts in Cross-lingual Transfer Learning
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Both human and machine translation play a central role in cross-lingual transfer learning: many multilingual datasets have been created through professional translation services, and using machine translation to translate either the test set or the training set is a widely used transfer technique. In this paper, we show that such translation process can introduce subtle artifacts that have a notable impact in existing cross-lingual models. For instance, in natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them, which current models are highly sensitive to. We show that some previous findings in cross-lingual transfer learning need to be reconsidered in the light of this phenomenon. Based on the gained insights, we also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.

2019

pdf bib
An Effective Approach to Unsupervised Machine Translation
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

While machine translation has traditionally relied on large amounts of parallel corpora, a recent research line has managed to train both Neural Machine Translation (NMT) and Statistical Machine Translation (SMT) systems using monolingual corpora only. In this paper, we identify and address several deficiencies of existing unsupervised SMT approaches by exploiting subword information, developing a theoretically well founded unsupervised tuning method, and incorporating a joint refinement procedure. Moreover, we use our improved SMT system to initialize a dual NMT model, which is further fine-tuned through on-the-fly back-translation. Together, we obtain large improvements over the previous state-of-the-art in unsupervised machine translation. For instance, we get 22.5 BLEU points in English-to-German WMT 2014, 5.5 points more than the previous best unsupervised system, and 0.5 points more than the (supervised) shared task winner back in 2014.

pdf bib
Analyzing the Limitations of Cross-lingual Word Embedding Mappings
Aitor Ormazabal | Mikel Artetxe | Gorka Labaka | Aitor Soroa | Eneko Agirre
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train word embeddings in different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, which states that word embeddings in different languages have approximately the same structure, it is not clear whether this is an inherent limitation of mapping approaches or a more general issue when learning cross-lingual embeddings. So as to answer this question, we experiment with parallel corpora, which allows us to compare offline mapping to an extension of skip-gram that jointly learns both embedding spaces. We observe that, under these ideal conditions, joint learning yields to more isomorphic embeddings, is less sensitive to hubness, and obtains stronger results in bilingual lexicon induction. We thus conclude that current mapping methods do have strong limitations, calling for further research to jointly learn cross-lingual embeddings with a weaker cross-lingual signal.

pdf bib
Bilingual Lexicon Induction through Unsupervised Machine Translation
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

A recent research line has obtained strong results on bilingual lexicon induction by aligning independently trained word embeddings in two languages and using the resulting cross-lingual embeddings to induce word translation pairs through nearest neighbor or related retrieval methods. In this paper, we propose an alternative approach to this problem that builds on the recent work on unsupervised machine translation. This way, instead of directly inducing a bilingual lexicon from cross-lingual embeddings, we use them to build a phrase-table, combine it with a language model, and use the resulting machine translation system to generate a synthetic parallel corpus, from which we extract the bilingual lexicon using statistical word alignment techniques. As such, our method can work with any word embedding and cross-lingual mapping technique, and it does not require any additional resource besides the monolingual corpus used to train the embeddings. When evaluated on the exact same cross-lingual embeddings, our proposed method obtains an average improvement of 6 accuracy points over nearest neighbor and 4 points over CSLS retrieval, establishing a new state-of-the-art in the standard MUSE dataset.

pdf bib
Leveraging SNOMED CT terms and relations for machine translation of clinical texts from Basque to Spanish
Xabier Soto | Olatz Perez-De-Viñaspre | Maite Oronoz | Gorka Labaka
Proceedings of the Second Workshop on Multilingualism at the Intersection of Knowledge Bases and Machine Translation

2018

pdf bib
Konbitzul: an MWE-specific database for Spanish-Basque
Uxoa Iñurrieta | Itziar Aduriz | Arantza Díaz de Ilarraza | Gorka Labaka | Kepa Sarasola
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Building Named Entity Recognition Taggers via Parallel Corpora
Rodrigo Agerri | Yiling Chung | Itziar Aldabe | Nora Aranberri | Gorka Labaka | German Rigau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Uncovering Divergent Linguistic Information in Word Embeddings with Lessons for Intrinsic and Extrinsic Evaluation
Mikel Artetxe | Gorka Labaka | Iñigo Lopez-Gazpio | Eneko Agirre
Proceedings of the 22nd Conference on Computational Natural Language Learning

Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, as different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax and similarity/relatedness. In this paper, we show that each embedding model captures more information than directly apparent. A linear transformation that adjusts the similarity order of the model without any external resource can tailor it to achieve better results in those aspects, providing a new perspective on how embeddings encode divergent linguistic information. In addition, we explore the relation between intrinsic and extrinsic evaluation, as the effect of our transformations in downstream tasks is higher for unsupervised systems than for supervised ones.

pdf bib
Unsupervised Statistical Machine Translation
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

While modern machine translation has relied on large parallel corpora, a recent line of work has managed to train Neural Machine Translation (NMT) systems from monolingual corpora only (Artetxe et al., 2018c; Lample et al., 2018). Despite the potential of this approach for low-resource settings, existing systems are far behind their supervised counterparts, limiting their practical interest. In this paper, we propose an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems. Our method profits from the modular architecture of SMT: we first induce a phrase table from monolingual corpora through cross-lingual embedding mappings, combine it with an n-gram language model, and fine-tune hyperparameters through an unsupervised MERT variant. In addition, iterative backtranslation improves results further, yielding, for instance, 14.08 and 26.22 BLEU points in WMT 2014 English-German and English-French, respectively, an improvement of more than 7-10 BLEU points over previous unsupervised systems, and closing the gap with supervised SMT (Moses trained on Europarl) down to 2-5 BLEU points. Our implementation is available at https://github.com/artetxem/monoses.

pdf bib
A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent work has managed to learn cross-lingual word embeddings without parallel data by mapping monolingual embeddings to a shared space through adversarial training. However, their evaluation has focused on favorable conditions, using comparable corpora or closely-related languages, and we show that they often fail in more realistic scenarios. This work proposes an alternative approach based on a fully unsupervised initialization that explicitly exploits the structural similarity of the embeddings, and a robust self-learning algorithm that iteratively improves this solution. Our method succeeds in all tested scenarios and obtains the best published results in standard datasets, even surpassing previous supervised systems. Our implementation is released as an open source project at https://github.com/artetxem/vecmap.

2017

pdf bib
Learning bilingual word embeddings with (almost) no bilingual data
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most methods to learn bilingual word embeddings rely on large parallel corpora, which is difficult to obtain for most language pairs. This has motivated an active research line to relax this requirement, with methods that use document-aligned corpora or bilingual dictionaries of a few thousand words instead. In this work, we further reduce the need of bilingual resources using a very simple self-learning approach that can be combined with any dictionary-based mapping technique. Our method exploits the structural similarity of embedding spaces, and works with as little bilingual evidence as a 25 word dictionary or even an automatically generated list of numerals, obtaining results comparable to those of systems that use richer resources.

pdf bib
Rule-Based Translation of Spanish Verb-Noun Combinations into Basque
Uxoa Iñurrieta | Itziar Aduriz | Arantza Díaz de Ilarraza | Gorka Labaka | Kepa Sarasola
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

This paper presents a method to improve the translation of Verb-Noun Combinations (VNCs) in a rule-based Machine Translation (MT) system for Spanish-Basque. Linguistic information about a set of VNCs is gathered from the public database Konbitzul, and it is integrated into the MT system, leading to an improvement in BLEU, NIST and TER scores, as well as the results being evidently better according to human evaluators.

2016

pdf bib
Domain Adaptation in MT Using Titles in Wikipedia as a Parallel Corpus: Resources and Evaluation
Gorka Labaka | Iñaki Alegria | Kepa Sarasola
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents how an state-of-the-art SMT system is enriched by using an extra in-domain parallel corpora extracted from Wikipedia. We collect corpora from parallel titles and from parallel fragments in comparable articles from Wikipedia. We carried out an evaluation with a double objective: evaluating the quality of the extracted data and evaluating the improvement due to the domain-adaptation. We think this can be very useful for languages with limited amount of parallel corpora, where in-domain data is crucial to improve the performance of MT sytems. The experiments on the Spanish-English language pair improve a baseline trained with the Europarl corpus in more than 2 points of BLEU when translating in the Computer Science domain.

pdf bib
Learning principled bilingual mappings of word embeddings while preserving monolingual invariance
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Using Linguistic Data for English and Spanish Verb-Noun Combination Identification
Uxoa Iñurrieta | Arantza Díaz de Ilarraza | Gorka Labaka | Kepa Sarasola | Itziar Aduriz | John Carroll
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We present a linguistic analysis of a set of English and Spanish verb+noun combinations (VNCs), and a method to use this information to improve VNC identification. Firstly, a sample of frequent VNCs are analysed in-depth and tagged along lexico-semantic and morphosyntactic dimensions, obtaining satisfactory inter-annotator agreement scores. Then, a VNC identification experiment is undertaken, where the analysed linguistic data is combined with chunking information and syntactic dependencies. A comparison between the results of the experiment and the results obtained by a basic detection method shows that VNC identification can be greatly improved by using linguistic information, as a large number of additional occurrences are detected with high precision.

pdf bib
SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task
Rosa Gaudio | Gorka Labaka | Eneko Agirre | Petya Osenova | Kiril Simov | Martin Popel | Dieke Oele | Gertjan van Noord | Luís Gomes | João António Rodrigues | Steven Neale | João Silva | Andreia Querido | Nuno Rendeiro | António Branco
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
IXA Biomedical Translation System at WMT16 Biomedical Translation Task
Olatz Perez-de-Viñaspre | Gorka Labaka
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Adding syntactic structure to bilingual terminology for improved domain adaptation
Mikel Artetxe | Gorka Labaka | Chakaveh Saedi | João Rodrigues | João Silva | António Branco | Eneko Agirre
Proceedings of the 2nd Deep Machine Translation Workshop

2015

pdf bib
Exploiting portability to build an RBMT prototype for a new source language
Nora Aranberri | Gorka Labaka | Arantza Díaz de Ilarraza | Kepa Sarasola
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Building hybrid machine translation systems by using an EBMT preprocessor to create partialtranslations
Mikel Artetxe | Gorka Labaka | Kepa Sarasola
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Analyzing English-Spanish Named-Entity enhanced Machine Translation
Mikel Artetxe | Eneko Agirre | Inaki Alegria | Gorka Labaka
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Exploiting portability to build an RBMT prototype for a new source language
Nora Aranberri | Gorka Labaka | Arantza Díaz de Ilarraza | Kepa Sarasola
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Building hybrid machine translation systems by using an EBMT preprocessor to create partial translations
Mikel Artetxe | Gorka Labaka | Kepa Sarasola
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Deep-syntax TectoMT for English-Spanish MT
Gorka Labaka | Oneka Jauregi | Arantza Díaz de Ilarraza | Michael Ustaszewski | Nora Aranberri | Eneko Agirre
Proceedings of the 1st Deep Machine Translation Workshop

2012

pdf bib
Developing an Open-Source FST Grammar for Verb Chain Transfer in a Spanish-Basque MT System
Aingeru Mayor | Mans Hulden | Gorka Labaka
Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing

2010

pdf bib
Plagiarism Detection across Distant Language Pairs
Alberto Barrón-Cedeño | Paolo Rosso | Eneko Agirre | Gorka Labaka
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
A Morphological Processor Based on Foma for Biscayan (a Basque dialect)
Iñaki Alegria | Garbiñe Aranbarri | Klara Ceberio | Gorka Labaka | Bittor Laskurain | Ruben Urizar
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a new morphological processor for Biscayan, a dialect of Basque, developed on the description of the morphology of standard Basque. The database for the standard morphology has been extended for dialects and an open-source tool for morphological description named foma is used for building the processor. Biscayan is a dialect of the Basque language spoken mainly in Biscay, a province on the western of the Basque Country. The description of the lexicon and the morphotactics (or word grammar) for the standard Basque was carried out using a relational database and the database has been extended in order to include dialectal variants linked to the standard entries. XuxenB, a spelling checker/corrector for this dialect, is the first application of this work. Additionally to the basic analyzer used for spelling, a new transducer is included. It is an enhanced analyzer for linking standard form with the corresponding standard ones. It is used in correction for generation of proposals when in the input text appear standard forms which we want to replace with dialectal forms.

2009

pdf bib
Use of Rich Linguistic Information to Translate Prepositions and Grammar Cases to Basque
Eneko Agirre | Aitziber Atutxa | Gorka Labaka | Mikel Lersundi | Aingeru Mayor | Kepa Sarasola
Proceedings of the 13th Annual conference of the European Association for Machine Translation

pdf bib
Relevance of Different Segmentation Options on Spanish-Basque SMT
Arantza Díaz de Ilarraza | Gorka Labaka | Kepa Sarasola
Proceedings of the 13th Annual conference of the European Association for Machine Translation

2008

pdf bib
Strategies for sustainable MT for Basque: incremental design, reusability, standardization and open-source
I. Alegria | X. Arregi | X. Artola | A. Diaz de Ilarraza | G. Labaka | M. Lersundi | A. Mayor | K. Sarasola
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages