Kathrein Abu Kwaik


pdf bib
An Arabic Tweets Sentiment Analysis Dataset (ATSAD) using Distant Supervision and Self Training
Kathrein Abu Kwaik | Stergios Chatzikyriakidis | Simon Dobnik | Motaz Saad | Richard Johansson
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

As the number of social media users increases, they express their thoughts, needs, socialise and publish their opinions reviews. For good social media sentiment analysis, good quality resources are needed, and the lack of these resources is particularly evident for languages other than English, in particular Arabic. The available Arabic resources lack of from either the size of the corpus or the quality of the annotation. In this paper, we present an Arabic Sentiment Analysis Corpus collected from Twitter, which contains 36K tweets labelled into positive and negative. We employed distant supervision and self-training approaches into the corpus to annotate it. Besides, we release an 8K tweets manually annotated as a gold standard. We evaluated the corpus intrinsically by comparing it to human classification and pre-trained sentiment analysis models, Moreover, we apply extrinsic evaluation methods exploiting sentiment analysis task and achieve an accuracy of 86%.


pdf bib
Proceedings of the 13th International Conference on Computational Semantics - Student Papers
Simon Dobnik | Stergios Chatzikyriakidis | Vera Demberg | Kathrein Abu Kwaik | Vladislav Maraev
Proceedings of the 13th International Conference on Computational Semantics - Student Papers

pdf bib
ArbDialectID at MADAR Shared Task 1: Language Modelling and Ensemble Learning for Fine Grained Arabic Dialect Identification
Kathrein Abu Kwaik | Motaz Saad
Proceedings of the Fourth Arabic Natural Language Processing Workshop

In this paper, we present a Dialect Identification system (ArbDialectID) that competed at Task 1 of the MADAR shared task, MADARTravel Domain Dialect Identification. We build a course and a fine-grained identification model to predict the label (corresponding to a dialect of Arabic) of a given text. We build two language models by extracting features at two levels (words and characters). We firstly build a coarse identification model to classify each sentence into one out of six dialects, then use this label as a feature for the fine-grained model that classifies the sentence among 26 dialects from different Arab cities, after that we apply ensemble voting classifier on both sub-systems. Our system ranked 1st that achieving an f-score of 67.32%. Both the models and our feature engineering tools are made available to the research community.


pdf bib
Shami: A Corpus of Levantine Arabic Dialects
Kathrein Abu Kwaik | Motaz Saad | Stergios Chatzikyriakidis | Simon Dobnik
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)