Code-switching is the use of more than one language in the same conversation or utterance. Recently, multilingual contextual embedding models, trained on multiple monolingual corpora, have shown promising results on cross-lingual and multilingual tasks. We present an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish. Specifically, our evaluation benchmark includes Language Identification from text, POS tagging, Named Entity Recognition, Sentiment Analysis, Question Answering and a new task for code-switching, Natural Language Inference. We present results on all these tasks using cross-lingual word embedding models and multilingual models. In addition, we fine-tune multilingual models on artificially generated code-switched data. Although multilingual models perform significantly better than cross-lingual models, our results show that in most tasks, across both language pairs, multilingual models fine-tuned on code-switched data perform best, showing that multilingual models can be further optimized for code-switching tasks.
In this paper, we present our system for the SemEval 2020 task on code-mixed sentiment analysis. Our system makes use of large transformer based multilingual embeddings like mBERT. Recent work has shown that these models posses the ability to solve code-mixed tasks in addition to their originally demonstrated cross-lingual abilities. We evaluate the stock versions of these models for the sentiment analysis task and also show that their performance can be improved by using unlabelled code-mixed data. Our submission (username Genius1237) achieved the second rank on the English-Hindi subtask with an F1 score of 0.726.
In this paper, we explore the methods of obtaining parse trees of code-mixed sentences and analyse the obtained trees. Existing work has shown that linguistic theories can be used to generate code-mixed sentences from a set of parallel sentences. We build upon this work, using one of these theories, the Equivalence-Constraint theory to obtain the parse trees of synthetically generated code-mixed sentences and evaluate them with a neural constituency parser. We highlight the lack of a dataset non-synthetic code-mixed constituency parse trees and how it makes our evaluation difficult. To complete our evaluation, we convert a code-mixed dependency parse tree set into “pseudo constituency trees” and find that a parser trained on synthetically generated trees is able to decently parse these as well.
Multilingual communities exhibit code-mixing, that is, mixing of two or more socially stable languages in a single conversation, sometimes even in a single utterance. This phenomenon has been widely studied by linguists and interaction scientists in the spoken language of such communities. However, with the prevalence of social media and other informal interactive platforms, code-switching is now also ubiquitously observed in user-generated text. As multilingual communities are more the norm from a global perspective, it becomes essential that code-switched text and speech are adequately handled by language technologies and NUIs.Code-mixing is extremely prevalent in all multilingual societies. Current studies have shown that as much as 20% of user generated content from some geographies, like South Asia, parts of Europe, and Singapore, are code-mixed. Thus, it is very important to handle code-mixed content as a part of NLP systems and applications for these geographies.In the past 5 years, there has been an active interest in computational models for code-mixing with a substantive research outcome in terms of publications, datasets and systems. However, it is not easy to find a single point of access for a complete and coherent overview of the research. This tutorial is expecting to fill this gap and provide new researchers in the area with a foundation in both linguistic and computational aspects of code-mixing. We hope that this then becomes a starting point for those who wish to pursue research, design, development and deployment of code-mixed systems in multilingual societies.