Mladen Karan


2020

pdf bib
Classification-Based Self-Learning for Weakly Supervised Bilingual Lexicon Induction
Mladen Karan | Ivan Vulić | Anna Korhonen | Goran Glavaš
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Effective projection-based cross-lingual word embedding (CLWE) induction critically relies on the iterative self-learning procedure. It gradually expands the initial small seed dictionary to learn improved cross-lingual mappings. In this work, we present ClassyMap, a classification-based approach to self-learning, yielding a more robust and a more effective induction of projection-based CLWEs. Unlike prior self-learning methods, our approach allows for integration of diverse features into the iterative process. We show the benefits of ClassyMap for bilingual lexicon induction: we report consistent improvements in a weakly supervised setup (500 seed translation pairs) on a benchmark with 28 language pairs.

pdf bib
XHate-999: Analyzing and Detecting Abusive Language Across Domains and Languages
Goran Glavaš | Mladen Karan | Ivan Vulić
Proceedings of the 28th International Conference on Computational Linguistics

We present XHate-999, a multi-domain and multilingual evaluation data set for abusive language detection. By aligning test instances across six typologically diverse languages, XHate-999 for the first time allows for disentanglement of the domain transfer and language transfer effects in abusive language detection. We conduct a series of domain- and language-transfer experiments with state-of-the-art monolingual and multilingual transformer models, setting strong baseline results and profiling XHate-999 as a comprehensive evaluation resource for abusive language detection. Finally, we show that domain- and language-adaption, via intermediate masked language modeling on abusive corpora in the target language, can lead to substantially improved abusive language detection in the target language in the zero-shot transfer setups.

2019

pdf bib
Preemptive Toxic Language Detection in Wikipedia Comments Using Thread-Level Context
Mladen Karan | Jan Šnajder
Proceedings of the Third Workshop on Abusive Language Online

We address the task of automatically detecting toxic content in user generated texts. We fo cus on exploring the potential for preemptive moderation, i.e., predicting whether a particular conversation thread will, in the future, incite a toxic comment. Moreover, we perform preliminary investigation of whether a model that jointly considers all comments in a conversation thread outperforms a model that considers only individual comments. Using an existing dataset of conversations among Wikipedia contributors as a starting point, we compile a new large-scale dataset for this task consisting of labeled comments and comments from their conversation threads.

pdf bib
Data Set for Stance and Sentiment Analysis from User Comments on Croatian News
Mihaela Bošnjak | Mladen Karan
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

Nowadays it is becoming more important than ever to find new ways of extracting useful information from the evergrowing amount of user-generated data available online. In this paper, we describe the creation of a data set that contains news articles and corresponding comments from Croatian news outlet 24 sata. Our annotation scheme is specifically tailored for the task of detecting stances and sentiment from user comments as well as assessing if commentator claims are verifiable. Through this data, we hope to get a better understanding of the publics viewpoint on various events. In addition, we also explore the potential of applying supervised machine learning models toautomate annotation of more data.

2018

pdf bib
Combining Shallow and Deep Learning for Aggressive Text Detection
Viktor Golem | Mladen Karan | Jan Šnajder
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

We describe the participation of team TakeLab in the aggression detection shared task at the TRAC1 workshop for English. Aggression manifests in a variety of ways. Unlike some forms of aggression that are impossible to prevent in day-to-day life, aggressive speech abounding on social networks could in principle be prevented or at least reduced by simply disabling users that post aggressively worded messages. The first step in achieving this is to detect such messages. The task, however, is far from being trivial, as what is considered as aggressive speech can be quite subjective, and the task is further complicated by the noisy nature of user-generated text on social networks. Our system learns to distinguish between open aggression, covert aggression, and non-aggression in social media texts. We tried different machine learning approaches, including traditional (shallow) machine learning models, deep learning models, and a combination of both. We achieved respectable results, ranking 4th and 8th out of 31 submissions on the Facebook and Twitter test sets, respectively.

pdf bib
Cross-Domain Detection of Abusive Language Online
Mladen Karan | Jan Šnajder
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)

We investigate to what extent the models trained to detect general abusive language generalize between different datasets labeled with different abusive language types. To this end, we compare the cross-domain performance of simple classification models on nine different datasets, finding that the models fail to generalize to out-domain datasets and that having at least some in-domain data is important. We also show that using the frustratingly simple domain adaptation (Daume III, 2007) in most cases improves the results over in-domain training, especially when used to augment a smaller dataset with a larger one.

2017

pdf bib
TakeLab-QA at SemEval-2017 Task 3: Classification Experiments for Answer Retrieval in Community QA
Filip Šaina | Toni Kukurin | Lukrecija Puljić | Mladen Karan | Jan Šnajder
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper we present the TakeLab-QA entry to SemEval 2017 task 3, which is a question-comment re-ranking problem. We present a classification based approach, including two supervised learning models – Support Vector Machines (SVM) and Convolutional Neural Networks (CNN). We use features based on different semantic similarity models (e.g., Latent Dirichlet Allocation), as well as features based on several types of pre-trained word embeddings. Moreover, we also use some hand-crafted task-specific features. For training, our system uses no external labeled data apart from that provided by the organizers. Our primary submission achieves a MAP-score of 81.14 and F1-score of 66.99 – ranking us 10th on the SemEval 2017 task 3, subtask A.

2016

pdf bib
Analysis of Policy Agendas: Lessons Learned from Automatic Topic Classification of Croatian Political Texts
Mladen Karan | Jan Šnajder | Daniela Širinić | Goran Glavaš
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
TakeLab at SemEval-2016 Task 6: Stance Classification in Tweets Using a Genetic Algorithm Based Ensemble
Martin Tutek | Ivan Sekulić | Paula Gombar | Ivan Paljak | Filip Čulinović | Filip Boltužić | Mladen Karan | Domagoj Alagić | Jan Šnajder
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
TKLBLIIR: Detecting Twitter Paraphrases with TweetingJay
Mladen Karan | Goran Glavaš | Jan Šnajder | Bojana Dalbelo Bašić | Ivan Vulić | Marie-Francine Moens
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2013

pdf bib
Frequently Asked Questions Retrieval for Croatian Based on Semantic Textual Similarity
Mladen Karan | Lovro Žmak | Jan Šnajder
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

2012

pdf bib
TakeLab: Systems for Measuring Semantic Text Similarity
Frane Šarić | Goran Glavaš | Mladen Karan | Jan Šnajder | Bojana Dalbelo Bašić
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Evaluation of Classification Algorithms and Features for Collocation Extraction in Croatian
Mladen Karan | Jan Šnajder | Bojana Dalbelo Bašić
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Collocations can be defined as words that occur together significantly more often than it would be expected by chance. Many natural language processing applications such as natural language generation, word sense disambiguation and machine translation can benefit from having access to information about collocated words. We approach collocation extraction as a classification problem where the task is to classify a given n-gram as either a collocation (positive) or a non-collocation (negative). Among the features used are word frequencies, classical association measures (Dice, PMI, chi2), and POS tags. In addition, semantic word relatedness modeled by latent semantic analysis is also included. We apply wrapper feature subset selection to determine the best set of features. Performance of various classification algorithms is tested. Experiments are conducted on a manually annotated set of bigrams and trigrams sampled from a Croatian newspaper corpus. Best results obtained are 79.8 F1 measure for bigrams and 67.5 F1 measure for trigrams. The best classifier for bigrams was SVM, while for trigrams the decision tree gave the best performance. Features which contributed the most to overall performance were PMI, semantic relatedness, and POS information.