Magnus Sahlgren


2020

pdf bib
Text Categorization for Conflict Event Annotation
Fredrik Olsson | Magnus Sahlgren | Fehmi ben Abdesslem | Ariel Ekgren | Kristine Eck
Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020

We cast the problem of event annotation as one of text categorization, and compare state of the art text categorization techniques on event data produced within the Uppsala Conflict Data Program (UCDP). Annotating a single text involves assigning the labels pertaining to at least 17 distinct categorization tasks, e.g., who were the attacking organization, who was attacked, and where did the event take place. The text categorization techniques under scrutiny are a classical Bag-of-Words approach; character-based contextualized embeddings produced by ELMo; embeddings produced by the BERT base model, and a version of BERT base fine-tuned on UCDP data; and a pre-trained and fine-tuned classifier based on ULMFiT. The categorization tasks are very diverse in terms of the number of classes to predict as well as the skeweness of the distribution of classes. The categorization results exhibit a large variability across tasks, ranging from 30.3% to 99.8% F-score.

pdf bib
SenseCluster at SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection
Amaru Cuba Gyllensten | Evangelia Gogoulou | Ariel Ekgren | Magnus Sahlgren
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We (Team Skurt) propose a simple method to detect lexical semantic change by clustering contextualized embeddings produced by XLM-R, using K-Means++. The basic idea is that contextualized embeddings that encode the same sense are located in close proximity in the embedding space. Our approach is both simple and generic, but yet performs relatively good in both sub-tasks of SemEval-2020 Task 1. We hypothesize that the main shortcoming of our method lies in the simplicity of the clustering method used.

pdf bib
Rethinking Topic Modelling: From Document-Space to Term-Space
Magnus Sahlgren
Findings of the Association for Computational Linguistics: EMNLP 2020

This paper problematizes the reliance on documents as the basic notion for defining term interactions in standard topic models. As an alternative to this practice, we reformulate topic distributions as latent factors in term similarity space. We exemplify the idea using a number of standard word embeddings built with very wide context windows. The embedding spaces are transformed to sparse similarity spaces, and topics are extracted in standard fashion by factorizing to a lower-dimensional space. We use a number of different factorization techniques, and evaluate the various models using a large set of evaluation metrics, including previously published coherence measures, as well as a number of novel measures that we suggest better correspond to real-world applications of topic models. Our results clearly demonstrate that term-based models outperform standard document-based models by a large margin.

2019

pdf bib
R-grams: Unsupervised Learning of Semantic Units in Natural Language
Amaru Cuba Gyllensten | Ariel Ekgren | Magnus Sahlgren
Proceedings of the 13th International Conference on Computational Semantics - Student Papers

This paper investigates data-driven segmentation using Re-Pair or Byte Pair Encoding-techniques. In contrast to previous work which has primarily been focused on subword units for machine translation, we are interested in the general properties of such segments above the word level. We call these segments r-grams, and discuss their properties and the effect they have on the token frequency distribution. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and multilingual test settings. We also provide a number of qualitative examples of the proposed methodology, demonstrating its viability as a language-invariant segmentation procedure.

pdf bib
Gender Bias in Pretrained Swedish Embeddings
Magnus Sahlgren | Fredrik Olsson
Proceedings of the 22nd Nordic Conference on Computational Linguistics

This paper investigates the presence of gender bias in pretrained Swedish embeddings. We focus on a scenario where names are matched with occupations, and we demonstrate how a number of standard pretrained embeddings handle this task. Our experiments show some significant differences between the pretrained embeddings, with word-based methods showing the most bias and contextualized language models showing the least. We also demonstrate that the previously proposed debiasing method does not affect the performance of the various embeddings in this scenario.

2018

pdf bib
Distributional Term Set Expansion
Amaru Cuba Gyllensten | Magnus Sahlgren
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Learning Representations for Detecting Abusive Language
Magnus Sahlgren | Tim Isbister | Fredrik Olsson
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)

This paper discusses the question whether it is possible to learn a generic representation that is useful for detecting various types of abusive language. The approach is inspired by recent advances in transfer learning and word embeddings, and we learn representations from two different datasets containing various degrees of abusive language. We compare the learned representation with two standard approaches; one based on lexica, and one based on data-specific n-grams. Our experiments show that learned representations do contain useful information that can be used to improve detection performance when training data is limited.

pdf bib
Measuring Issue Ownership using Word Embeddings
Amaru Cuba Gyllensten | Magnus Sahlgren
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Sentiment and topic analysis are common methods used for social media monitoring. Essentially, these methods answers questions such as, “what is being talked about, regarding X”, and “what do people feel, regarding X”. In this paper, we investigate another venue for social media monitoring, namely issue ownership and agenda setting, which are concepts from political science that have been used to explain voter choice and electoral outcomes. We argue that issue alignment and agenda setting can be seen as a kind of semantic source similarity of the kind “how similar is source A to issue owner P, when talking about issue X”, and as such can be measured using word/document embedding techniques. We present work in progress towards measuring that kind of conditioned similarity, and introduce a new notion of similarity for predictive embeddings. We then test this method by measuring the similarity between politically aligned media and political parties, conditioned on bloc-specific issues.

2016

pdf bib
The Gavagai Living Lexicon
Magnus Sahlgren | Amaru Cuba Gyllensten | Fredrik Espinoza | Ola Hamfors | Jussi Karlgren | Fredrik Olsson | Per Persson | Akshay Viswanathan | Anders Holst
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages. We describe the underlying distributional semantic model, and how we have solved some of the challenges in applying such a model to large amounts of streaming data. We also describe the architecture of our implementation, and discuss how we deal with continuous quality assurance of the lexicon.

pdf bib
The Effects of Data Size and Frequency Range on Distributional Semantic Models
Magnus Sahlgren | Alessandro Lenci
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Parameterized context windows in Random Indexing
Tobias Norlund | David Nilsson | Magnus Sahlgren
Proceedings of the 1st Workshop on Representation Learning for NLP

pdf bib
Unshared task: (Dis)agreement in online debates
Maria Skeppstedt | Magnus Sahlgren | Carita Paradis | Andreas Kerren
Proceedings of the Third Workshop on Argument Mining (ArgMining2016)

pdf bib
Active learning for detection of stance components
Maria Skeppstedt | Magnus Sahlgren | Carita Paradis | Andreas Kerren
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)

Automatic detection of five language components, which are all relevant for expressing opinions and for stance taking, was studied: positive sentiment, negative sentiment, speculation, contrast and condition. A resource-aware approach was taken, which included manual annotation of 500 training samples and the use of limited lexical resources. Active learning was compared to random selection of training data, as well as to a lexicon-based method. Active learning was successful for the categories speculation, contrast and condition, but not for the two sentiment categories, for which results achieved when using active learning were similar to those achieved when applying a random selection of training data. This difference is likely due to a larger variation in how sentiment is expressed than in how speakers express the other three categories. This larger variation was also shown by the lower recall results achieved by the lexicon-based approach for sentiment than for the categories speculation, contrast and condition.

2015

pdf bib
Factorization of Latent Variables in Distributional Semantic Models
Arvid Österlund | David Ödling | Magnus Sahlgren
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Navigating the Semantic Horizon using Relative Neighborhood Graphs
Amaru Cuba Gyllensten | Magnus Sahlgren
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Detecting speculations, contrasts and conditionals in consumer reviews
Maria Skeppstedt | Teri Schamp-Bjerede | Magnus Sahlgren | Carita Paradis | Andreas Kerren
Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

2010

pdf bib
Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics
Magnus Sahlgren | Ola Knutsson
Proceedings of the NAACL HLT Workshop on Extracting and Using Constructions in Computational Linguistics

2007

pdf bib
SICS: Valence annotation based on seeds in word space
Magnus Sahlgren | Jussi Karlgren | Gunnar Eriksson
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib
Towards pertinent evaluation methodologies for word-space models
Magnus Sahlgren
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper discusses evaluation methodologies for a particular kind of meaning models known as word-space models, which use distributional information to assemble geometric representations of meaning similarities. Word-space models have received considerable attention in recent years, and have begun to see employment outside the walls of computational linguistics laboratories. However, the evaluation methodologies of such models remain infantile, and lack efforts at standardization. Very few studies have critically assessed the methodologies used to evaluate word spaces. This paper attempts to fill some of this void. It is the central goal of this paper to answer the question “how can we determine whether a given word space is a good word space?”

pdf bib
Creating bilingual lexica using reference wordlists for alignment of monolingual semantic vector spaces
Jon Holmlund | Magnus Sahlgren | Jussi Karlgren
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)

2004

pdf bib
Using Bag-of-Concepts to Improve the Performance of Support Vector Machines in Text Categorization
Magnus Sahlgren | Rickard Cöster
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Automatic Bilingual Lexicon Acquisition Using Random Indexing of Aligned Bilingual Data
Magnus Sahlgren
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2001

pdf bib
Using Linguistic Information to Improve the Performance of Vector-Based Semantic Analysis
Magnus Sahlgren | David Swanberg
Proceedings of the 13th Nordic Conference of Computational Linguistics (NODALIDA 2001)