ArbDialectID at MADAR Shared Task 1: Language Modelling and Ensemble Learning for Fine Grained Arabic Dialect Identification

Kathrein Abu Kwaik, Motaz Saad


Abstract
In this paper, we present a Dialect Identification system (ArbDialectID) that competed at Task 1 of the MADAR shared task, MADARTravel Domain Dialect Identification. We build a course and a fine-grained identification model to predict the label (corresponding to a dialect of Arabic) of a given text. We build two language models by extracting features at two levels (words and characters). We firstly build a coarse identification model to classify each sentence into one out of six dialects, then use this label as a feature for the fine-grained model that classifies the sentence among 26 dialects from different Arab cities, after that we apply ensemble voting classifier on both sub-systems. Our system ranked 1st that achieving an f-score of 67.32%. Both the models and our feature engineering tools are made available to the research community.
Anthology ID:
W19-4632
Volume:
Proceedings of the Fourth Arabic Natural Language Processing Workshop
Month:
August
Year:
2019
Address:
Florence, Italy
Venues:
ACL | WANLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
254–258
Language:
URL:
https://www.aclweb.org/anthology/W19-4632
DOI:
10.18653/v1/W19-4632
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W19-4632.pdf