Luis Espinosa Anke

Also published as: Luis Espinosa-Anke, Luis Espinosa Anke


2020

pdf bib
Cardiff University at SemEval-2020 Task 6: Fine-tuning BERT for Domain-Specific Definition Classification
Shelan Jeawak | Luis Espinosa-Anke | Steven Schockaert
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We describe the system submitted to SemEval-2020 Task 6, Subtask 1. The aim of this subtask is to predict whether a given sentence contains a definition or not. Unsurprisingly, we found that strong results can be achieved by fine-tuning a pre-trained BERT language model. In this paper, we analyze the performance of this strategy. Among others, we show that results can be improved by using a two-step fine-tuning process, in which the BERT model is first fine-tuned on the full training set, and then further specialized towards a target domain.

pdf bib
On the Robustness of Unsupervised and Semi-supervised Cross-lingual Word Embedding Learning
Yerai Doval | Jose Camacho-Collados | Luis Espinosa Anke | Steven Schockaert
Proceedings of the 12th Language Resources and Evaluation Conference

Cross-lingual word embeddings are vector representations of words in different languages where words with similar meaning are represented by similar vectors, regardless of the language. Recent developments which construct these embeddings by aligning monolingual spaces have shown that accurate alignments can be obtained with little or no supervision, which usually comes in the form of bilingual dictionaries. However, the focus has been on a particular controlled scenario for evaluation, and there is no strong evidence on how current state-of-the-art systems would fare with noisy text or for language pairs with major linguistic differences. In this paper we present an extensive evaluation over multiple cross-lingual embedding models, analyzing their strengths and limitations with respect to different variables such as target language, training corpora and amount of supervision. Our conclusions put in doubt the view that high-quality cross-lingual embeddings can always be learned without much supervision.

pdf bib
TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification
Francesco Barbieri | Jose Camacho-Collados | Luis Espinosa Anke | Leonardo Neves
Findings of the Association for Computational Linguistics: EMNLP 2020

The experimental landscape in natural language processing for social media is too fragmented. Each year, new shared tasks and datasets are proposed, ranging from classics like sentiment analysis to irony detection or emoji prediction. Therefore, it is unclear what the current state of the art is, as there is no standardized evaluation protocol, neither a strong set of baselines trained on such domain-specific data. In this paper, we propose a new evaluation framework (TweetEval) consisting of seven heterogeneous Twitter-specific classification tasks. We also provide a strong set of baselines as starting point, and compare different language modeling pre-training strategies. Our initial experiments show the effectiveness of starting off with existing pre-trained generic language models, and continue training them on Twitter corpora.

pdf bib
Combining BERT with Static Word Embeddings for Categorizing Social Media
Israa Alghanmi | Luis Espinosa Anke | Steven Schockaert
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Pre-trained neural language models (LMs) have achieved impressive results in various natural language processing tasks, across different languages. Surprisingly, this extends to the social media genre, despite the fact that social media often has very different characteristics from the language that LMs have seen during training. A particularly striking example is the performance of AraBERT, an LM for the Arabic language, which is successful in categorizing social media posts in Arabic dialects, despite only having been trained on Modern Standard Arabic. Our hypothesis in this paper is that the performance of LMs for social media can nonetheless be improved by incorporating static word vectors that have been specifically trained on social media. We show that a simple method for incorporating such word vectors is indeed successful in several Arabic and English benchmarks. Curiously, however, we also find that similar improvements are possible with word vectors that have been trained on traditional text sources (e.g. Wikipedia).

pdf bib
CollFrEn: Rich Bilingual English–French Collocation Resource
Beatriz Fisas | Joan Codina-Filbá | Luis Espinosa Anke | Leo Wanner
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

Collocations in the sense of idiosyncratic lexical co-occurrences of two syntactically bound words traditionally pose a challenge to language learners and many Natural Language Processing (NLP) applications alike. Reliable ground truth (i.e., ideally manually compiled) resources are thus of high value. We present a manually compiled bilingual English–French collocation resource with 7,480 collocations in English and 6,733 in French. Each collocation is enriched with information that facilitates its downstream exploitation in NLP tasks such as machine translation, word sense disambiguation, natural language generation, relation classification, and so forth. Our proposed enrichment covers: the semantic category of the collocation (its lexical function), its vector space representation (for each individual word as well as their joint collocation embedding), a subcategorization pattern of both its elements, as well as their corresponding BabelNet id, and finally, indices of their occurrences in large scale reference corpora.

pdf bib
Proceedings of the 1st Workshop on NLP for Music and Audio (NLP4MusA)
Sergio Oramas | Luis Espinosa-Anke | Elena Epure | Rosie Jones | Mohamed Sordo | Massimo Quadrana | Kento Watanabe
Proceedings of the 1st Workshop on NLP for Music and Audio (NLP4MusA)

pdf bib
Don’t Patronize Me! An Annotated Dataset with Patronizing and Condescending Language towards Vulnerable Communities
Carla Perez Almendros | Luis Espinosa Anke | Steven Schockaert
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we introduce a new annotated dataset which is aimed at supporting the development of NLP models to identify and categorize language that is patronizing or condescending towards vulnerable communities (e.g. refugees, homeless people, poor families). While the prevalence of such language in the general media has long been shown to have harmful effects, it differs from other types of harmful language, in that it is generally used unconsciously and with good intentions. We furthermore believe that the often subtle nature of patronizing and condescending language (PCL) presents an interesting technical challenge for the NLP community. Our analysis of the proposed dataset shows that identifying PCL is hard for standard NLP models, with language models such as BERT achieving the best results.

pdf bib
Definition Extraction Feature Analysis: From Canonical to Naturally-Occurring Definitions
Mireia Roig Mirapeix | Luis Espinosa Anke | Jose Camacho-Collados
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

Textual definitions constitute a fundamental source of knowledge when seeking the meaning of words, and they are the cornerstone of lexical resources like glossaries, dictionaries, encyclopedia or thesauri. In this paper, we present an in-depth analytical study on the main features relevant to the task of definition extraction. Our main goal is to study whether linguistic structures from canonical (the Aristotelian or genus et differentia model) can be leveraged to retrieve definitions from corpora in different domains of knowledge and textual genres alike. To this end, we develop a simple linear classifier and analyze the contribution of several (sets of) linguistic features. Finally, as a result of our experiments, we also shed light on the particularities of existing benchmarks as well as the most challenging aspects of the task.

pdf bib
Towards Preemptive Detection of Depression and Anxiety in Twitter
David Owen | Jose Camacho-Collados | Luis Espinosa Anke
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

Depression and anxiety are psychiatric disorders that are observed in many areas of everyday life. For example, these disorders manifest themselves somewhat frequently in texts written by nondiagnosed users in social media. However, detecting users with these conditions is not a straightforward task as they may not explicitly talk about their mental state, and if they do, contextual cues such as immediacy must be taken into account. When available, linguistic flags pointing to probable anxiety or depression could be used by medical experts to write better guidelines and treatments. In this paper, we develop a dataset designed to foster research in depression and anxiety detection in Twitter, framing the detection task as a binary tweet classification problem. We then apply state-of-the-art classification models to this dataset, providing a competitive set of baselines alongside qualitative error analysis. Our results show that language models perform reasonably well, and better than more traditional baselines. Nonetheless, there is clear room for improvement, particularly with unbalanced training sets and in cases where seemingly obvious linguistic cues (keywords) are used counter-intuitively.

2019

pdf bib
Relational Word Embeddings
Jose Camacho-Collados | Luis Espinosa Anke | Steven Schockaert
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

While word embeddings have been shown to implicitly encode various forms of attributional knowledge, the extent to which they capture relational information is far more limited. In previous work, this limitation has been addressed by incorporating relational knowledge from external knowledge bases when learning the word embedding. Such strategies may not be optimal, however, as they are limited by the coverage of available resources and conflate similarity with other forms of relatedness. As an alternative, in this paper we propose to encode relational knowledge in a separate word embedding, which is aimed to be complementary to a given standard word embedding. This relational word embedding is still learned from co-occurrence statistics, and can thus be used even when no external knowledge base is available. Our analysis shows that relational word vectors do indeed capture information that is complementary to what is encoded in standard word embeddings.

pdf bib
Collocation Classification with Unsupervised Relation Vectors
Luis Espinosa Anke | Steven Schockaert | Leo Wanner
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Lexical relation classification is the task of predicting whether a certain relation holds between a given pair of words. In this paper, we explore to which extent the current distributional landscape based on word embeddings provides a suitable basis for classification of collocations, i.e., pairs of words between which idiosyncratic lexical relations hold. First, we introduce a novel dataset with collocations categorized according to lexical functions. Second, we conduct experiments on a subset of this benchmark, comparing it in particular to the well known DiffVec dataset. In these experiments, in addition to simple word vector arithmetic operations, we also investigate the role of unsupervised relation vectors as a complementary input. While these relation vectors indeed help, we also show that lexical function classification poses a greater challenge than the syntactic and semantic relations that are typically used for benchmarks in the literature.

pdf bib
Proceedings of the 5th Workshop on Semantic Deep Learning (SemDeep-5)
Luis Espinosa-Anke | Thierry Declerck | Dagmar Gromann | Jose Camacho-Collados | Mohammad Taher Pilehvar
Proceedings of the 5th Workshop on Semantic Deep Learning (SemDeep-5)

pdf bib
Cardiff University at SemEval-2019 Task 4: Linguistic Features for Hyperpartisan News Detection
Carla Pérez-Almendros | Luis Espinosa-Anke | Steven Schockaert
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper summarizes our contribution to the Hyperpartisan News Detection task in SemEval 2019. We experiment with two different approaches: 1) an SVM classifier based on word vector averages and hand-crafted linguistic features, and 2) a BiLSTM-based neural text classifier trained on a filtered training set. Surprisingly, despite their different nature, both approaches achieve an accuracy of 0.74. The main focus of this paper is to further analyze the remarkable fact that a simple feature-based approach can perform on par with modern neural classifiers. We also highlight the effectiveness of our filtering strategy for training the neural network on a large but noisy training set.

2018

pdf bib
SemEval 2018 Task 2: Multilingual Emoji Prediction
Francesco Barbieri | Jose Camacho-Collados | Francesco Ronzano | Luis Espinosa-Anke | Miguel Ballesteros | Valerio Basile | Viviana Patti | Horacio Saggion
Proceedings of The 12th International Workshop on Semantic Evaluation

This paper describes the results of the first Shared Task on Multilingual Emoji Prediction, organized as part of SemEval 2018. Given the text of a tweet, the task consists of predicting the most likely emoji to be used along such tweet. Two subtasks were proposed, one for English and one for Spanish, and participants were allowed to submit a system run to one or both subtasks. In total, 49 teams participated to the English subtask and 22 teams submitted a system run to the Spanish subtask. Evaluation was carried out emoji-wise, and the final ranking was based on macro F-Score. Data and further information about this task can be found at https://competitions.codalab.org/competitions/17344.

pdf bib
SemEval-2018 Task 9: Hypernym Discovery
Jose Camacho-Collados | Claudio Delli Bovi | Luis Espinosa-Anke | Sergio Oramas | Tommaso Pasini | Enrico Santus | Vered Shwartz | Roberto Navigli | Horacio Saggion
Proceedings of The 12th International Workshop on Semantic Evaluation

This paper describes the SemEval 2018 Shared Task on Hypernym Discovery. We put forward this task as a complementary benchmark for modeling hypernymy, a problem which has traditionally been cast as a binary classification task, taking a pair of candidate words as input. Instead, our reformulated task is defined as follows: given an input term, retrieve (or discover) its suitable hypernyms from a target corpus. We proposed five different subtasks covering three languages (English, Spanish, and Italian), and two specific domains of knowledge in English (Medical and Music). Participants were allowed to compete in any or all of the subtasks. Overall, a total of 11 teams participated, with a total of 39 different systems submitted through all subtasks. Data, results and further information about the task can be found at https://competitions.codalab.org/competitions/17119.

pdf bib
Syntactically Aware Neural Architectures for Definition Extraction
Luis Espinosa-Anke | Steven Schockaert
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Automatically identifying definitional knowledge in text corpora (Definition Extraction or DE) is an important task with direct applications in, among others, Automatic Glossary Generation, Taxonomy Learning, Question Answering and Semantic Search. It is generally cast as a binary classification problem between definitional and non-definitional sentences. In this paper we present a set of neural architectures combining Convolutional and Recurrent Neural Networks, which are further enriched by incorporating linguistic information via syntactic dependencies. Our experimental results in the task of sentence classification, on two benchmarking DE datasets (one generic, one domain-specific), show that these models obtain consistent state of the art results. Furthermore, we demonstrate that models trained on clean Wikipedia-like definitions can successfully be applied to more noisy domain-specific corpora.

pdf bib
The interplay between lexical resources and Natural Language Processing
Jose Camacho-Collados | Luis Espinosa Anke | Mohammad Taher Pilehvar
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts

Incorporating linguistic, world and common sense knowledge into AI/NLP systems is currently an important research area, with several open problems and challenges. At the same time, processing and storing this knowledge in lexical resources is not a straightforward task. We propose to address these complementary goals from two methodological perspectives: the use of NLP methods to help the process of constructing and enriching lexical resources and the use of lexical resources for improving NLP applications. This tutorial may be useful for two main types of audience: those working on language resources who are interested in becoming acquainted with automatic NLP techniques, with the end goal of speeding and/or easing up the process of resource curation; and on the other hand, researchers in NLP who would like to benefit from the knowledge of lexical resources to improve their systems and models.

pdf bib
Improving Cross-Lingual Word Embeddings by Meeting in the Middle
Yerai Doval | Jose Camacho-Collados | Luis Espinosa-Anke | Steven Schockaert
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Cross-lingual word embeddings are becoming increasingly important in multilingual NLP. Recently, it has been shown that these embeddings can be effectively learned by aligning two disjoint monolingual vector spaces through linear transformations, using no more than a small bilingual dictionary as supervision. In this work, we propose to apply an additional transformation after the initial alignment step, which moves cross-lingual synonyms towards a middle point between them. By applying this transformation our aim is to obtain a better cross-lingual integration of the vector spaces. In addition, and perhaps surprisingly, the monolingual spaces also improve by this transformation. This is in contrast to the original alignment, which is typically learned such that the structure of the monolingual spaces is preserved. Our experiments confirm that the resulting cross-lingual embeddings outperform state-of-the-art models in both monolingual and cross-lingual evaluation tasks.

pdf bib
Interpretable Emoji Prediction via Label-Wise Attention LSTMs
Francesco Barbieri | Luis Espinosa-Anke | Jose Camacho-Collados | Steven Schockaert | Horacio Saggion
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Human language has evolved towards newer forms of communication such as social media, where emojis (i.e., ideograms bearing a visual meaning) play a key role. While there is an increasing body of work aimed at the computational modeling of emoji semantics, there is currently little understanding about what makes a computational model represent or predict a given emoji in a certain way. In this paper we propose a label-wise attention mechanism with which we attempt to better understand the nuances underlying emoji prediction. In addition to advantages in terms of interpretability, we show that our proposed architecture improves over standard baselines in emoji prediction, and does particularly well when predicting infrequent emojis.

pdf bib
SeVeN: Augmenting Word Embeddings with Unsupervised Relation Vectors
Luis Espinosa-Anke | Steven Schockaert
Proceedings of the 27th International Conference on Computational Linguistics

We present SeVeN (Semantic Vector Networks), a hybrid resource that encodes relationships between words in the form of a graph. Different from traditional semantic networks, these relations are represented as vectors in a continuous vector space. We propose a simple pipeline for learning such relation vectors, which is based on word vector averaging in combination with an ad hoc autoencoder. We show that by explicitly encoding relational information in a dedicated vector space we can capture aspects of word meaning that are complementary to what is captured by word embeddings. For example, by examining clusters of relation vectors, we observe that relational similarities can be identified at a more abstract level than with traditional word vector differences. Finally, we test the effectiveness of semantic vector networks in two tasks: measuring word similarity and neural text categorization. SeVeN is available at bitbucket.org/luisespinosa/seven.

pdf bib
Proceedings of the Third Workshop on Semantic Deep Learning
Luis Espinosa Anke | Dagmar Gromann | Thierry Declerck
Proceedings of the Third Workshop on Semantic Deep Learning

2017

pdf bib
Towards the Understanding of Gaming Audiences by Modeling Twitch Emotes
Francesco Barbieri | Luis Espinosa-Anke | Miguel Ballesteros | Juan Soler-Company | Horacio Saggion
Proceedings of the 3rd Workshop on Noisy User-generated Text

Videogame streaming platforms have become a paramount example of noisy user-generated text. These are websites where gaming is broadcasted, and allows interaction with viewers via integrated chatrooms. Probably the best known platform of this kind is Twitch, which has more than 100 million monthly viewers. Despite these numbers, and unlike other platforms featuring short messages (e.g. Twitter), Twitch has not received much attention from the Natural Language Processing community. In this paper we aim at bridging this gap by proposing two important tasks specific to the Twitch platform, namely (1) Emote prediction; and (2) Trolling detection. In our experiments, we evaluate three models: a BOW baseline, a logistic supervised classifiers based on word embeddings, and a bidirectional long short-term memory recurrent neural network (LSTM). Our results show that the LSTM model outperforms the other two models, where explicit features with proven effectiveness for similar tasks were encoded.

2016

pdf bib
Example-based Acquisition of Fine-grained Collocation Resources
Sara Rodríguez-Fernández | Roberto Carlini | Luis Espinosa Anke | Leo Wanner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Collocations such as “heavy rain” or “make [a] decision”, are combinations of two elements where one (the base) is freely chosen, while the choice of the other (collocate) is restricted, depending on the base. Collocations present difficulties even to advanced language learners, who usually struggle to find the right collocate to express a particular meaning, e.g., both “heavy” and “strong” express the meaning ‘intense’, but while “rain” selects “heavy”, “wind” selects “strong”. Lexical Functions (LFs) describe the meanings that hold between the elements of collocations, such as ‘intense’, ‘perform’, ‘create’, ‘increase’, etc. Language resources with semantically classified collocations would be of great help for students, however they are expensive to build, since they are manually constructed, and scarce. We present an unsupervised approach to the acquisition and semantic classification of collocations according to LFs, based on word embeddings in which, given an example of a collocation for each of the target LFs and a set of bases, the system retrieves a list of collocates for each base and LF.

pdf bib
ELMD: An Automatically Generated Entity Linking Gold Standard Dataset in the Music Domain
Sergio Oramas | Luis Espinosa Anke | Mohamed Sordo | Horacio Saggion | Xavier Serra
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present a gold standard dataset for Entity Linking (EL) in the Music Domain. It contains thousands of musical named entities such as Artist, Song or Record Label, which have been automatically annotated on a set of artist biographies coming from the Music website and social network Last.fm. The annotation process relies on the analysis of the hyperlinks present in the source texts and in a voting-based algorithm for EL, which considers, for each entity mention in text, the degree of agreement across three state-of-the-art EL systems. Manual evaluation shows that EL Precision is at least 94%, and due to its tunable nature, it is possible to derive annotations favouring higher Precision or Recall, at will. We make available the annotated dataset along with evaluation data and the code.

pdf bib
Supervised Distributional Hypernym Discovery via Domain Adaptation
Luis Espinosa-Anke | Jose Camacho-Collados | Claudio Delli Bovi | Horacio Saggion
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Extending WordNet with Fine-Grained Collocational Information via Supervised Distributional Learning
Luis Espinosa-Anke | Jose Camacho-Collados | Sara Rodríguez-Fernández | Horacio Saggion | Leo Wanner
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

WordNet is probably the best known lexical resource in Natural Language Processing. While it is widely regarded as a high quality repository of concepts and semantic relations, updating and extending it manually is costly. One important type of relation which could potentially add enormous value to WordNet is the inclusion of collocational information, which is paramount in tasks such as Machine Translation, Natural Language Generation and Second Language Learning. In this paper, we present ColWordNet (CWN), an extended WordNet version with fine-grained collocational information, automatically introduced thanks to a method exploiting linear relations between analogous sense-level embeddings spaces. We perform both intrinsic and extrinsic evaluations, and release CWN for the use and scrutiny of the community.

pdf bib
Semantics-Driven Recognition of Collocations Using Word Embeddings
Sara Rodríguez-Fernández | Luis Espinosa-Anke | Roberto Carlini | Leo Wanner
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
TALN at SemEval-2016 Task 11: Modelling Complex Words by Contextual, Lexical and Semantic Features
Francesco Ronzano | Ahmed Abura’ed | Luis Espinosa-Anke | Horacio Saggion
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
TALN at SemEval-2016 Task 14: Semantic Taxonomy Enrichment Via Sense-Based Embeddings
Luis Espinosa-Anke | Francesco Ronzano | Horacio Saggion
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Knowledge Base Unification via Sense Embeddings and Disambiguation
Claudio Delli Bovi | Luis Espinosa-Anke | Roberto Navigli
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Weakly Supervised Definition Extraction
Luis Espinosa-Anke | Horacio Saggion | Francesco Ronzano
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
TALN-UPF: Taxonomy Learning Exploiting CRF-Based Hypernym Extraction on Encyclopedic Definitions
Luis Espinosa-Anke | Horacio Saggion | Francesco Ronzano
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2013

pdf bib
Towards Definition Extraction Using Conditional Random Fields
Luis Espinosa Anke
Proceedings of the Student Research Workshop associated with RANLP 2013