Simplified guidelines for the creation of Large Scale Dialectal Arabic Annotations

Heba Elfardy, Mona Diab


Abstract
The Arabic language is a collection of dialectal variants along with the standard form, Modern Standard Arabic (MSA). MSA is used in official Settings while the dialectal variants (DA) correspond to the native tongue of the Arabic speakers. Arabic speakers typically code switch between DA and MSA, which is reflected extensively in written online social media. Automatic processing such Arabic genre is very difficult for automated NLP tools since the linguistic difference between MSA and DA is quite profound. However, no annotated resources exist for marking the regions of such switches in the utterance. In this paper, we present a simplified Set of guidelines for detecting code switching in Arabic on the word/token level. We use these guidelines in annotating a corpus that is rich in DA with frequent code switching to MSA. We present both a quantitative and qualitative analysis of the annotations.
Anthology ID:
L12-1482
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
371–378
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/815_Paper.pdf
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/815_Paper.pdf