Gender Profiling for Slovene Twitter communication: the Influence of Gender Marking, Content and Style

Ben Verhoeven, Iza Škrjanec, Senja Pollak


Abstract
We present results of the first gender classification experiments on Slovene text to our knowledge. Inspired by the TwiSty corpus and experiments (Verhoeven et al., 2016), we employed the Janes corpus (Erjavec et al., 2016) and its gender annotations to perform gender classification experiments on Twitter text comparing a token-based and a lemma-based approach. We find that the token-based approach (92.6% accuracy), containing gender markings related to the author, outperforms the lemma-based approach by about 5%. Especially in the lemmatized version, we also observe stylistic and content-based differences in writing between men (e.g. more profane language, numerals and beer mentions) and women (e.g. more pronouns, emoticons and character flooding). Many of our findings corroborate previous research on other languages.
Anthology ID:
W17-1418
Volume:
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
Month:
April
Year:
2017
Address:
Valencia, Spain
Venues:
BSNLP | WS
SIG:
SIGSLAV
Publisher:
Association for Computational Linguistics
Note:
Pages:
119–125
Language:
URL:
https://www.aclweb.org/anthology/W17-1418
DOI:
10.18653/v1/W17-1418
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W17-1418.pdf