Ander Barrena


2020

pdf bib
Give your Text Representation Models some Love: the Case for Basque
Rodrigo Agerri | Iñaki San Vicente | Jon Ander Campos | Ander Barrena | Xabier Saralegi | Aitor Soroa | Eneko Agirre
Proceedings of the 12th Language Resources and Evaluation Conference

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.

2018

pdf bib
Learning Text Representations for 500K Classification Tasks on Named Entity Disambiguation
Ander Barrena | Aitor Soroa | Eneko Agirre
Proceedings of the 22nd Conference on Computational Natural Language Learning

Named Entity Disambiguation algorithms typically learn a single model for all target entities. In this paper we present a word expert model and train separate deep learning models for each target entity string, yielding 500K classification tasks. This gives us the opportunity to benchmark popular text representation alternatives on this massive dataset. In order to face scarce training data we propose a simple data-augmentation technique and transfer-learning. We show that bag-of-word-embeddings are better than LSTMs for tasks with scarce training data, while the situation is reversed when having larger amounts. Transferring a LSTM which is learned on all datasets is the most effective context representation option for the word experts in all frequency bands. The experiments show that our system trained on out-of-domain Wikipedia data surpass comparable NED systems which have been trained on in-domain training data.

2016

pdf bib
Alleviating Poor Context with Background Knowledge for Named Entity Disambiguation
Ander Barrena | Aitor Soroa | Eneko Agirre
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
Combining Mention Context and Hyperlinks from Wikipedia for Named Entity Disambiguation
Ander Barrena | Aitor Soroa | Eneko Agirre
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

2014

pdf bib
“One Entity per Discourse” and “One Entity per Collocation” Improve Named-Entity Disambiguation
Ander Barrena | Eneko Agirre | Bernardo Cabaleiro | Anselmo Peñas | Aitor Soroa
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2012

pdf bib
Matching Cultural Heritage items to Wikipedia
Eneko Agirre | Ander Barrena | Oier Lopez de Lacalle | Aitor Soroa | Samuel Fernando | Mark Stevenson
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Digitised Cultural Heritage (CH) items usually have short descriptions and lack rich contextual information. Wikipedia articles, on the contrary, include in-depth descriptions and links to related articles, which motivate the enrichment of CH items with information from Wikipedia. In this paper we explore the feasibility of finding matching articles in Wikipedia for a given Cultural Heritage item. We manually annotated a random sample of items from Europeana, and performed a qualitative and quantitative study of the issues and problems that arise, showing that each kind of CH item is different and needs a nuanced definition of what ``matching article'' means. In addition, we test a well-known wikification (aka entity linking) algorithm on the task. Our results indicate that a substantial number of items can be effectively linked to their corresponding Wikipedia article.