Kenneth Steimel


2020

pdf bib
Fine-Grained Morpho-Syntactic Analysis for the Under-Resourced Language Chaghatay
Kenneth Steimel | Akbar Amat | Arienne Dwyer | Sandra Kübler
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

2019

pdf bib
Investigating Multilingual Abusive Language Detection: A Cautionary Tale
Kenneth Steimel | Daniel Dakota | Yue Chen | Sandra Kübler
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Abusive language detection has received much attention in the last years, and recent approaches perform the task in a number of different languages. We investigate which factors have an effect on multilingual settings, focusing on the compatibility of data and annotations. In the current paper, we focus on English and German. Our findings show large differences in performance between the two languages. We find that the best performance is achieved by different classification algorithms. Sampling to address class imbalance issues is detrimental for German and beneficial for English. The only similarity that we find is that neither data set shows clear topics when we compare the results of topic modeling to the gold standard. Based on our findings, we can conclude that a multilingual optimization of classifiers is not possible even in settings where comparable data sets are used.

2018

pdf bib
Part of Speech Tagging in Luyia: A Bantu Macrolanguage
Kenneth Steimel
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

Luyia is a macrolanguage in central Kenya. The Luyia languages, like other Bantu languages, have a complex morphological system. This system can be leveraged to aid in part of speech tagging. Bag-of-characters taggers trained on a source Luyia language can be applied directly to another Luyia language with some degree of success. In addition, mixing data from the target language with data from the source language does produce more accurate predictive models compared to models trained on just the target language data when the training set size is small. However, for both of these tagging tasks, models involving the more distantly related language, Tiriki, are better at predicting part of speech tags for Wanga data. The models incorporating Bukusu data are not as successful despite the closer relationship between Bukusu and Wanga. Overlapping vocabulary between the Wanga and Tiriki corpora as well as a bias towards open class words help Tiriki outperform Bukusu.

2016

pdf bib
IUCL at SemEval-2016 Task 6: An Ensemble Model for Stance Detection in Twitter
Can Liu | Wen Li | Bradford Demarest | Yue Chen | Sara Couture | Daniel Dakota | Nikita Haduong | Noah Kaufman | Andrew Lamont | Manan Pancholi | Kenneth Steimel | Sandra Kübler
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)