Cross-corpus Native Language Identification via Statistical Embedding

Francisco Rangel, Paolo Rosso, Julian Brooke, Alexandra Uitdenbogerd


Abstract
In this paper, we approach the task of native language identification in a realistic cross-corpus scenario where a model is trained with available data and has to predict the native language from data of a different corpus. The motivation behind this study is to investigate native language identification in the Australian academic scenario where a majority of students come from China, Indonesia, and Arabic-speaking nations. We have proposed a statistical embedding representation reporting a significant improvement over common single-layer approaches of the state of the art, identifying Chinese, Arabic, and Indonesian in a cross-corpus scenario. The proposed approach was shown to be competitive even when the data is scarce and imbalanced.
Anthology ID:
W18-1605
Volume:
Proceedings of the Second Workshop on Stylistic Variation
Month:
June
Year:
2018
Address:
New Orleans
Venues:
NAACL | Style-Var | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
39–43
Language:
URL:
https://www.aclweb.org/anthology/W18-1605
DOI:
10.18653/v1/W18-1605
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W18-1605.pdf