Iñaki San Vicente


2020

pdf bib
Building a Task-oriented Dialog System for Languages with no Training Data: the Case for Basque
Maddalen López de Lacalle | Xabier Saralegi | Iñaki San Vicente
Proceedings of the 12th Language Resources and Evaluation Conference

This paper presents an approach for developing a task-oriented dialog system for less-resourced languages in scenarios where training data is not available. Both intent classification and slot filling are tackled. We project the existing annotations in rich-resource languages by means of Neural Machine Translation (NMT) and posterior word alignments. We then compare training on the projected monolingual data with direct model transfer alternatives. Intent Classifiers and slot filling sequence taggers are implemented using a BiLSTM architecture or by fine-tuning BERT transformer models. Models learnt exclusively from Basque projected data provide better accuracies for slot filling. Combining Basque projected train data with rich-resource languages data outperforms consistently models trained solely on projected data for intent classification. At any rate, we achieve competitive performance in both tasks, with accuracies of 81% for intent classification and 77% for slot filling.

pdf bib
Give your Text Representation Models some Love: the Case for Basque
Rodrigo Agerri | Iñaki San Vicente | Jon Ander Campos | Ander Barrena | Xabier Saralegi | Aitor Soroa | Eneko Agirre
Proceedings of the 12th Language Resources and Evaluation Conference

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.

2015

pdf bib
EliXa: A Modular and Flexible ABSA Platform
Iñaki San Vicente | Xabier Saralegi | Rodrigo Agerri
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib
Simple, Robust and (almost) Unsupervised Generation of Polarity Lexicons for Multiple Languages
Iñaki San Vicente | Rodrigo Agerri | German Rigau
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
TweetNorm_es: an annotated corpus for Spanish microtext normalization
Iñaki Alegria | Nora Aranberri | Pere Comas | Víctor Fresno | Pablo Gamallo | Lluis Padró | Iñaki San Vicente | Jordi Turmo | Arkaitz Zubiaga
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.

2012

pdf bib
Building a Basque-Chinese Dictionary by Using English as Pivot
Xabier Saralegi | Iker Manterola | Iñaki San Vicente
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Bilingual dictionaries are key resources in several fields such as translation, language learning or various NLP tasks. However, only major languages have such resources. Automatically built dictionaries by using pivot languages could be a useful resource in these circumstances. Pivot-based bilingual dictionary building is based on merging two bilingual dictionaries which share a common language (e.g. LA-LB, LB-LC) in order to create a dictionary for a new language pair (e.g LA-LC). This process may include wrong translations due to the polisemy of words. We built Basque-Chinese (Mandarin) dictionaries automatically from Basque-English and Chinese-English dictionaries. In order to prune wrong translations we used different methods adequate for less resourced languages. Inverse Consultation and Distributional Similarity methods are used because they just depend on easily available resources. Finally, we evaluated manually the quality of the built dictionaries and the adequacy of the methods. Both Inverse Consultation and Distributional Similarity provide good precision of translations but recall is seriously damaged. Distributional similarity prunes rare translations more accurately than other methods.

pdf bib
PaCo2: A Fully Automated tool for gathering Parallel Corpora from the Web
Iñaki San Vicente | Iker Manterola
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The importance of parallel corpora in the NLP field is fully acknowledged. This paper presents a tool that can build parallel corpora given just a seed word list and a pair of languages. Our approach is similar to others proposed in the literature, but introduces a new phase to the process. While most of the systems leave the task of finding websites containing parallel content up to the user, PaCo2 (Parallel Corpora Collector) takes care of that as well. The tool is language independent as far as possible, and adapting the system to work with new languages is fairly straightforward. Evaluation of the different modules has been carried out for Basque-Spanish, Spanish-English and Portuguese-English language pairs. Even though there is still room for improvement, results are positive. Results show that the corpora created have very good quality translations units, and the quality is maintained for the various language pairs. Details of the corpora created up until now are also provided.

2011

pdf bib
Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries
Xabier Saralegi | Iker Manterola | Iñaki San Vicente
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing