Iñaki Alegría

Also published as: Iñaki Alegria, I Alegria, Inaki Alegria, I. Alegria


2018

pdf bib
Measuring language distance among historical varieties using perplexity. Application to European Portuguese.
Jose Ramom Pichel Campos | Pablo Gamallo | Iñaki Alegria
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

The objective of this work is to quantify, with a simple and robust measure, the distance between historical varieties of a language. The measure will be inferred from text corpora corresponding to historical periods. Different approaches have been proposed for similar aims: Language Identification, Phylogenetics, Historical Linguistics or Dialectology. In our approach, we used a perplexity-based measure to calculate language distance between all the historical periods of a specific language: European Portuguese. Perplexity has also proven to be a robust metric to calculate distance between languages. However, this measure has not been tested yet to identify diachronic periods within the historical evolution of a specific language. For this purpose, a historical Portuguese corpus has been constructed from different open sources containing texts with close original spelling. The results of our experiments show that Portuguese keeps an important degree of homogeneity over time. We anticipate this metric to be a starting point to be applied to other languages.

pdf bib
Verbal Multiword Expressions in Basque Corpora
Uxoa Iñurrieta | Itziar Aduriz | Ainara Estarrona | Itziar Gonzalez-Dios | Antton Gurrutxaga | Ruben Urizar | Iñaki Alegria
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper presents a Basque corpus where Verbal Multiword Expressions (VMWEs) were annotated following universal guidelines. Information on the annotation is given, and some ideas for discussion upon the guidelines are also proposed. The corpus is useful not only for NLP-related research, but also to draw conclusions on Basque phraseology in comparison with other languages.

2017

pdf bib
A Comparison of Feature-Based and Neural Scansion of Poetry
Manex Agirrezabal | Iñaki Alegria | Mans Hulden
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Automatic analysis of poetic rhythm is a challenging task that involves linguistics, literature, and computer science. When the language to be analyzed is known, rule-based systems or data-driven methods can be used. In this paper, we analyze poetic rhythm in English and Spanish. We show that the representations of data learned from character-based neural models are more informative than the ones from hand-crafted features, and that a Bi-LSTM+CRF-model produces state-of-the art accuracy on scansion of poetry in two languages. Results also show that the information about whole word structure, and not just independent syllables, is highly informative for performing scansion.

pdf bib
A Perplexity-Based Method for Similar Languages Discrimination
Pablo Gamallo | Jose Ramom Pichel | Iñaki Alegria
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This article describes the system submitted by the Citius_Ixa_Imaxin team to the VarDial 2017 (DSL and GDI tasks). The strategy underlying our system is based on a language distance computed by means of model perplexity. The best model configuration we have tested is a voting system making use of several n-grams models of both words and characters, even if word unigrams turned out to be a very competitive model with reasonable results in the tasks we have participated. An error analysis has been performed in which we identified many test examples with no linguistic evidences to distinguish among the variants.

2016

pdf bib
Evaluating Translation Quality and CLIR Performance of Query Sessions
Xabier Saralegi | Eneko Agirre | Iñaki Alegria
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents the evaluation of the translation quality and Cross-Lingual Information Retrieval (CLIR) performance when using session information as the context of queries. The hypothesis is that previous queries provide context that helps to solve ambiguous translations in the current query. We tested several strategies on the TREC 2010 Session track dataset, which includes query reformulations grouped by generalization, specification, and drifting types. We study the Basque to English direction, evaluating both the translation quality and CLIR performance, with positive results in both cases. The results show that the quality of translation improved, reducing error rate by 12% (HTER) when using session information, which improved CLIR results 5% (nDCG). We also provide an analysis of the improvements across the three kinds of sessions: generalization, specification, and drifting. Translation quality improved in all three types (generalization, specification, and drifting), and CLIR improved for generalization and specification sessions, preserving the performance in drifting sessions.

pdf bib
Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and Slovene
Izaskun Etxeberria | Iñaki Alegria | Larraitz Uria | Mans Hulden
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a method for the normalization of historical texts using a combination of weighted finite-state transducers and language models. We have extended our previous work on the normalization of dialectal texts and tested the method against a 17th century literary work in Basque. This preprocessed corpus is made available in the LREC repository. The performance of this method for learning relations between historical and contemporary word forms is evaluated against resources in three languages. The method we present learns to map phonological changes using a noisy channel model. The model is based on techniques commonly used for phonological inference and producing Grapheme-to-Grapheme conversion systems encoded as weighted transducers and produces F-scores above 80% in the task for Basque. A wider evaluation shows that the approach performs equally well with all the languages in our evaluation suite: Basque, Spanish and Slovene. A comparison against other methods that address the same task is also provided.

pdf bib
Domain Adaptation in MT Using Titles in Wikipedia as a Parallel Corpus: Resources and Evaluation
Gorka Labaka | Iñaki Alegria | Kepa Sarasola
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents how an state-of-the-art SMT system is enriched by using an extra in-domain parallel corpora extracted from Wikipedia. We collect corpora from parallel titles and from parallel fragments in comparable articles from Wikipedia. We carried out an evaluation with a double objective: evaluating the quality of the extracted data and evaluating the improvement due to the domain-adaptation. We think this can be very useful for languages with limited amount of parallel corpora, where in-domain data is crucial to improve the performance of MT sytems. The experiments on the Spanish-English language pair improve a baseline trained with the Europarl corpus in more than 2 points of BLEU when translating in the Computer Science domain.

pdf bib
TweetMT: A Parallel Microblog Corpus
Iñaki San Vicente | Iñaki Alegría | Cristina España-Bonet | Pablo Gamallo | Hugo Gonçalo Oliveira | Eva Martínez Garcia | Antonio Toral | Arkaitz Zubiaga | Nora Aranberri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested.

pdf bib
Machine Learning for Metrical Analysis of English Poetry
Manex Agirrezabal | Iñaki Alegria | Mans Hulden
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this work we tackle the challenge of identifying rhythmic patterns in poetry written in English. Although poetry is a literary form that makes use standard meters usually repeated among different authors, we will see in this paper how performing such analyses is a difficult task in machine learning due to the unexpected deviations from such standard patterns. After breaking down some examples of classical poetry, we apply a number of NLP techniques for the scansion of poetry, training and testing our systems against a human-annotated corpus. With these experiments, our purpose is establish a baseline of automatic scansion of poetry using NLP tools in a straightforward manner and to raise awareness of the difficulties of this task.

pdf bib
EHU at the SIGMORPHON 2016 Shared Task. A Simple Proposal: Grapheme-to-Phoneme for Inflection
Iñaki Alegria | Izaskun Etxeberria
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
Combining Phonology and Morphology for the Normalization of Historical Texts
Izaskun Etxeberria | Iñaki Alegria | Larraitz Uria | Mans Hulden
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Comparing Two Basic Methods for Discriminating Between Similar Languages and Varieties
Pablo Gamallo | Iñaki Alegria | José Ramom Pichel | Manex Agirrezabal
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

This article describes the systems submitted by the Citius_Ixa_Imaxin team to the Discriminating Similar Languages Shared Task 2016. The systems are based on two different strategies: classification with ranked dictionaries and Naive Bayes classifiers. The results of the evaluation show that ranking dictionaries are more sound and stable across different domains while basic bayesian models perform reasonably well on in-domain datasets, but their performance drops when they are applied on out-of-domain texts.

2015

pdf bib
Analyzing English-Spanish Named-Entity enhanced Machine Translation
Mikel Artetxe | Eneko Agirre | Inaki Alegria | Gorka Labaka
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

2014

pdf bib
TweetNorm_es: an annotated corpus for Spanish microtext normalization
Iñaki Alegria | Nora Aranberri | Pere Comas | Víctor Fresno | Pablo Gamallo | Lluis Padró | Iñaki San Vicente | Jordi Turmo | Arkaitz Zubiaga
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.

2013

pdf bib
Combining Different Features of Idiomaticity for the Automatic Classification of Noun+Verb Expressions in Basque
Antton Gurrutxaga | Iñaki Alegria
Proceedings of the 9th Workshop on Multiword Expressions

2012

pdf bib
Measuring the compositionality of NV expressions in Basque by means of distributional similarity techniques
Antton Gurrutxaga | Iñaki Alegria
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present several experiments aiming at measuring the semantic compositionality of NV expressions in Basque. Our approach is based on the hypothesis that compositionality can be related to distributional similarity. The contexts of each NV expression are compared with the contexts of its corresponding components, by means of different techniques, as similarity measures usually used with the Vector Space Model (VSM), Latent Semantic Analysis (LSA) and some measures implemented in the Lemur Toolkit, as Indri index, tf-idf, Okapi index and Kullback-Leibler divergence. Using our previous work with cooccurrence techniques as a baseline, the results point to improvements using the Indri index or Kullback-Leibler divergence, and a slight further improvement when used in combination with cooccurrence measures such as $t$-score, via rank-aggregation. This work is part of a project for MWE extraction and characterization using different techniques aiming at measuring the properties related to idiomaticity, as institutionalization, non-compositionality and lexico-syntactic fixedness.

pdf bib
BAD: An Assistant tool for making verses in Basque
Manex Agirrezabal | Iñaki Alegria | Bertol Arrieta | Mans Hulden
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing
Iñaki Alegria | Mans Hulden
Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing

pdf bib
Finite-State Technology in a Verse-Making Tool
Manex Agirrezabal | Iñaki Alegria | Bertol Arrieta | Mans Hulden
Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing

2011

pdf bib
Recognition and Classification of Numerical Entities in Basque
Ander Soraluze | Iñaki Alegria | Olatz Ansa | Olatz Arregi | Xabier Arregi
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Automatic Extraction of NV Expressions in Basque: Basic Issues on Cooccurrence Techniques
Antton Gurrutxaga | Iñaki Alegria
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf bib
Learning word-level dialectal variation as phonological replacement rules using a limited parallel corpus
Mans Hulden | Iñaki Alegria | Izaskun Etxeberria | Montse Maritxalar
Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties

2010

pdf bib
A Morphological Processor Based on Foma for Biscayan (a Basque dialect)
Iñaki Alegria | Garbiñe Aranbarri | Klara Ceberio | Gorka Labaka | Bittor Laskurain | Ruben Urizar
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a new morphological processor for Biscayan, a dialect of Basque, developed on the description of the morphology of standard Basque. The database for the standard morphology has been extended for dialects and an open-source tool for morphological description named foma is used for building the processor. Biscayan is a dialect of the Basque language spoken mainly in Biscay, a province on the western of the Basque Country. The description of the lexicon and the morphotactics (or word grammar) for the standard Basque was carried out using a relational database and the database has been extended in order to include dialectal variants linked to the standard entries. XuxenB, a spelling checker/corrector for this dialect, is the first application of this work. Additionally to the basic analyzer used for spelling, a new transducer is included. It is an enhanced analyzer for linking standard form with the corresponding standard ones. It is used in correction for generation of proposals when in the input text appear standard forms which we want to replace with dialectal forms.

2008

pdf bib
Spelling Correction: from Two-Level Morphology to Open Source
Iñaki Alegria | Klara Ceberio | Nerea Ezeiza | Aitor Soroa | Gregorio Hernandez
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Basque is a highly inflected and agglutinative language (Alegria et al., 1996). Two-level morphology has been applied successfully to this kind of languages and there are two-level based descriptions for very different languages. After doing the morphological description for a language, it is easy to develop a spelling checker/corrector for this language. However, what happens if we want to use the speller in the “free world” (OpenOffice, Mozilla, emacs, LaTeX, etc.)? Ispell and similar tools (aspell, hunspell, myspell) are the usual mechanisms for these purposes, but they do not fit the two-level model. In the absence of two-level morphology based mechanisms, an automatic conversion from two-level description to hunspell is described in this paper.

pdf bib
Strategies for sustainable MT for Basque: incremental design, reusability, standardization and open-source
I. Alegria | X. Arregi | X. Artola | A. Diaz de Ilarraza | G. Labaka | M. Lersundi | A. Mayor | K. Sarasola
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages

2006

pdf bib
Using Machine Learning Techniques to Build a Comma Checker for Basque
Iñaki Alegria | Bertol Arrieta | Arantza Diaz de Ilarraza | Eli Izagirre | Montse Maritxalar
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf bib
Structure, Annotation and Tools in the Basque ZT Corpus
N. Areta | A. Gurrutxaga | I. Leturia | Z. Polin | R. Saiz | I. Alegria | X. Artola | A. Diaz de Ilarraza | N. Ezeiza | A. Sologaistoa | A. Soroa | A. Valverde
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The ZT corpus (Basque Corpus of Science and Technology) is a tagged collection of specialized texts in Basque, which wants to be a main resource in research and development about written technical Basque: terminology, syntax and style. It will be the first written corpus in Basque which will be distributed by ELDA (at the end of 2006) and it wants to be a methodological and functional reference for new projects in the future (i.e. a national corpus for Basque). We also present the technology and the tools to build this Corpus. These tools, Corpusgile and Eulia, provide a flexible and extensible infrastructure for creating, visualizing and managing corpora and for consulting, visualizing and modifying annotations generated by linguistic tools.

pdf bib
Named Entities Translation Based on Comparable Corpora
Iñaki Alegria | Nerea Ezeiza | Izaskun Fernandez
Proceedings of the Workshop on Multi-word-expressions in a multilingual context

pdf bib
A Multiclassifier based Document Categorization System: profiting from the Singular Value Decomposition Dimensionality Reduction Technique
Ana Zelaia | Iñaki Alegria | Olatz Arregi | Basilio Sierra
Proceedings of the Workshop on Learning Structured Information in Natural Language Applications

2005

pdf bib
An open-source shallow-transfer machine translation engine for the Romance languages of Spain
Antonio M. Corbi-Bellot | Mikel L. Forcada | Sergio Ortíz-Rojas | Juan Antonio Pérez-Ortiz | Gema Ramírez-Sánchez | Felipe Sánchez-Martínez | Iñaki Alegria | Aingeru Mayor | Kepa Sarasola
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

2004

pdf bib
A XML-Based Term Extraction Tool for Basque
I. Alegria | A. Gurrutxaga | P. Lizaso | X. Saralegi | S. Ugartetxea | R. Urizar
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This project combines linguistic and statistical information to develop a term extraction tool for Basque. Being Basque an agglutinative and highly inflected language, the treatment of morphosyntactic information is vital. In addition, due to late unification process of the language, texts present more elevated term dispersion than in a highly normalized language. The result is a semi-automatic terminology extraction tool based on XML, for its use in technical and scientific information managing.

pdf bib
Representation and Treatment of Multiword Expressions in Basque
Iñaki Alegria | Olatz Ansa | Xabier Artola | Nerea Ezeiza | Koldo Gojenola | Ruben Urizar
Proceedings of the Workshop on Multiword Expressions: Integrating Processing

2000

pdf bib
A word-grammar based morphological analyzer for agglutinative languages
I. Aduriz | E. Agirre | I. Aldezabal | I. Alegria | X. Arregi | J. M. Arriola | X. Artola | K. Gojenola | A. Maritxalar | K. Sarasola | M. Urkia
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

1999

pdf bib
Designing spelling correctors for inflected languages using lexical transducers
I. Aldezabal | I. Alegria | O. Ansa | J. M. Arriola | N. Ezeiza
Ninth Conference of the European Chapter of the Association for Computational Linguistics

1998

pdf bib
Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages
N. Ezeiza | I. Alegria | J.M. Arriola | R. Urizar | I. Aduriz
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages
N. Ezeiza | I. Alegria | J.M. Arriola | R. Urizar | I. Aduriz
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages
N. Ezeiza | I. Alegria | J.M. Arriola | R. Urizar | I. Aduriz
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

1993

pdf bib
A Morphological Analysis Based Method for Spelling Correction
I. Aduriz | E. Agirre | I. Alegria | X. Arregi | J.M Arriola | X. Artola | A. Diaz de Ilarraza | N. Ezeiza | M. Maritxalar | K. Sarasola | M. Urkia
Sixth Conference of the European Chapter of the Association for Computational Linguistics

1992

pdf bib
XUXEN: A Spelling Checker/Corrector for Basque Based on Two-Level Morphology
E. Agirre | I Alegria | X Arregi | X Artola | A Diaz de Ilarraza | M Maritxalar | K Sarasola | M Urkia
Third Conference on Applied Natural Language Processing