Kalika Bali


2020

pdf bib
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
Pratik Joshi | Sebastin Santy | Amar Budhiraja | Kalika Bali | Monojit Choudhury
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Our quantitative investigation underlines the disparity between languages, especially in terms of their resources, and calls into question the “language agnostic” status of current models and systems. Through this paper, we attempt to convince the ACL community to prioritise the resolution of the predicaments highlighted here, so that no language is left behind.

pdf bib
Crowdsourcing Speech Data for Low-Resource Languages from Low-Income Workers
Basil Abraham | Danish Goel | Divya Siddarth | Kalika Bali | Manu Chopra | Monojit Choudhury | Pratik Joshi | Preethi Jyoti | Sunayana Sitaram | Vivek Seshadri
Proceedings of the 12th Language Resources and Evaluation Conference

Voice-based technologies are essential to cater to the hundreds of millions of new smartphone users. However, most of the languages spoken by these new users have little to no labelled speech data. Unfortunately, collecting labelled speech data in any language is an expensive and resource-intensive task. Moreover, existing platforms typically collect speech data only from urban speakers familiar with digital technology whose dialects are often very different from low-income users. In this paper, we explore the possibility of collecting labelled speech data directly from low-income workers. In addition to providing diversity to the speech dataset, we believe this approach can also provide valuable supplemental earning opportunities to these communities. To this end, we conducted a study where we collected labelled speech data in the Marathi language from three different user groups: low-income rural users, low-income urban users, and university students. Overall, we collected 109 hours of data from 36 participants. Our results show that the data collected from low-income participants is of comparable quality to the data collected from university students (who are typically employed to do this work) and that crowdsourcing speech data from low-income rural and urban workers is a viable method of gathering speech data.

pdf bib
Learnings from Technological Interventions in a Low Resource Language: A Case-Study on Gondi
Devansh Mehta | Sebastin Santy | Ramaravind Kommiya Mothilal | Brij Mohan Lal Srivastava | Alok Sharma | Anurag Shukla | Vishnu Prasad | Venkanna U | Amit Sharma | Kalika Bali
Proceedings of the 12th Language Resources and Evaluation Conference

The primary obstacle to developing technologies for low-resource languages is the lack of usable data. In this paper, we report the adaption and deployment of 4 technology-driven methods of data collection for Gondi, a low-resource vulnerable language spoken by around 2.3 million tribal people in south and central India. In the process of data collection, we also help in its revival by expanding access to information in Gondi through the creation of linguistic resources that can be used by the community, such as a dictionary, children’s stories, an app with Gondi content from multiple sources and an Interactive Voice Response (IVR) based mass awareness platform. At the end of these interventions, we collected a little less than 12,000 translated words and/or sentences and identified more than 650 community members whose help can be solicited for future translation efforts. The larger goal of the project is collecting enough data in Gondi to build and deploy viable language technologies like machine translation and speech to text systems that can help take the language onto the internet.

pdf bib
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching
Thamar Solorio | Monojit Choudhury | Kalika Bali | Sunayana Sitaram | Amitava Das | Mona Diab
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching

pdf bib
Understanding Script-Mixing: A Case Study of Hindi-English Bilingual Twitter Users
Abhishek Srivastava | Kalika Bali | Monojit Choudhury
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching

In a multi-lingual and multi-script society such as India, many users resort to code-mixing while typing on social media. While code-mixing has received a lot of attention in the past few years, it has mostly been studied within a single-script scenario. In this work, we present a case study of Hindi-English bilingual Twitter users while considering the nuances that come with the intermixing of different scripts. We present a concise analysis of how scripts and languages interact in communities and cultures where code-mixing is rampant and offer certain insights into the findings. Our analysis shows that both intra-sentential and inter-sentential script-mixing are present on Twitter and show different behavior in different contexts. Examples suggest that script can be employed as a tool for emphasizing certain phrases within a sentence or disambiguating the meaning of a word. Script choice can also be an indicator of whether a word is borrowed or not. We present our analysis along with examples that bring out the nuances of the different cases.

pdf bib
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation
Girish Nath Jha | Kalika Bali | Sobha L. | S. S. Agrawal | Atul Kr. Ojha
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

2019

pdf bib
INMT: Interactive Neural Machine Translation Prediction
Sebastin Santy | Sandipan Dandapat | Monojit Choudhury | Kalika Bali
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

In this paper, we demonstrate an Interactive Machine Translation interface, that assists human translators with on-the-fly hints and suggestions. This makes the end-to-end translation process faster, more efficient and creates high-quality translations. We augment the OpenNMT backend with a mechanism to accept the user input and generate conditioned translations.

2018

pdf bib
An Integrated Representation of Linguistic and Social Functions of Code-Switching
Silvana Hartmann | Monojit Choudhury | Kalika Bali
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Discovering Canonical Indian English Accents: A Crowdsourcing-based Approach
Sunayana Sitaram | Varun Manjunath | Varun Bharadwaj | Monojit Choudhury | Kalika Bali | Michael Tjalve
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data
Adithya Pratapa | Gayatri Bhat | Monojit Choudhury | Sunayana Sitaram | Sandipan Dandapat | Kalika Bali
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Training language models for Code-mixed (CM) language is known to be a difficult problem because of lack of data compounded by the increased confusability due to the presence of more than one language. We present a computational technique for creation of grammatically valid artificial CM data based on the Equivalence Constraint Theory. We show that when training examples are sampled appropriately from this synthetic data and presented in certain order (aka training curriculum) along with monolingual and real CM data, it can significantly reduce the perplexity of an RNN-based language model. We also show that randomly generated CM data does not help in decreasing the perplexity of the LMs.

pdf bib
Phone Merging For Code-Switched Speech Recognition
Sunit Sivasankaran | Brij Mohan Lal Srivastava | Sunayana Sitaram | Kalika Bali | Monojit Choudhury
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

Speakers in multilingual communities often switch between or mix multiple languages in the same conversation. Automatic Speech Recognition (ASR) of code-switched speech faces many challenges including the influence of phones of different languages on each other. This paper shows evidence that phone sharing between languages improves the Acoustic Model performance for Hindi-English code-switched speech. We compare baseline system built with separate phones for Hindi and English with systems where the phones were manually merged based on linguistic knowledge. Encouraged by the improved ASR performance after manually merging the phones, we further investigate multiple data-driven methods to identify phones to be merged across the languages. We show detailed analysis of automatic phone merging in this language pair and the impact it has on individual phone accuracies and WER. Though the best performance gain of 1.2% WER was observed with manually merged phones, we show experimentally that the manual phone merge is not optimal.

pdf bib
Accommodation of Conversational Code-Choice
Anshul Bawa | Monojit Choudhury | Kalika Bali
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

Bilingual speakers often freely mix languages. However, in such bilingual conversations, are the language choices of the speakers coordinated? How much does one speaker’s choice of language affect other speakers? In this paper, we formulate code-choice as a linguistic style, and show that speakers are indeed sensitive to and accommodating of each other’s code-choice. We find that the saliency or markedness of a language in context directly affects the degree of accommodation observed. More importantly, we discover that accommodation of code-choices persists over several conversational turns. We also propose an alternative interpretation of conversational accommodation as a retrieval problem, and show that the differences in accommodation characteristics of code-choices are based on their markedness in context.

2017

pdf bib
Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique
Shruti Rijhwani | Royal Sequiera | Monojit Choudhury | Kalika Bali | Chandra Shekhar Maddila
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Word-level language detection is necessary for analyzing code-switched text, where multiple languages could be mixed within a sentence. Existing models are restricted to code-switching between two specific languages and fail in real-world scenarios as text input rarely has a priori information on the languages used. We present a novel unsupervised word-level language detection technique for code-switched text for an arbitrarily large number of languages, which does not require any manually annotated training data. Our experiments with tweets in seven languages show a 74% relative error reduction in word-level labeling with respect to competitive baselines. We then use this system to conduct a large-scale quantitative analysis of code-switching patterns on Twitter, both global as well as region-specific, with 58M tweets.

pdf bib
Curriculum Design for Code-switching: Experiments with Language Identification and Language Modeling with Deep Neural Networks
Monojit Choudhury | Kalika Bali | Sunayana Sitaram | Ashutosh Baheti
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

pdf bib
Functions of Code-Switching in Tweets: An Annotation Framework and Some Initial Experiments
Rafiya Begum | Kalika Bali | Monojit Choudhury | Koustav Rudra | Niloy Ganguly
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Code-Switching (CS) between two languages is extremely common in communities with societal multilingualism where speakers switch between two or more languages when interacting with each other. CS has been extensively studied in spoken language by linguists for several decades but with the popularity of social-media and less formal Computer Mediated Communication, we now see a big rise in the use of CS in the text form. This poses interesting challenges and a need for computational processing of such code-switched data. As with any Computational Linguistic analysis and Natural Language Processing tools and applications, we need annotated data for understanding, processing, and generation of code-switched language. In this study, we focus on CS between English and Hindi Tweets extracted from the Twitter stream of Hindi-English bilinguals. We present an annotation scheme for annotating the pragmatic functions of CS in Hindi-English (Hi-En) code-switched tweets based on a linguistic analysis and some initial experiments.

pdf bib
Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?
Koustav Rudra | Shruti Rijhwani | Rafiya Begum | Kalika Bali | Monojit Choudhury | Niloy Ganguly
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

pdf bib
POS Tagging of Hindi-English Code Mixed Text from Social Media: Some Machine Learning Experiments
Royal Sequiera | Monojit Choudhury | Kalika Bali
Proceedings of the 12th International Conference on Natural Language Processing

2014

pdf bib
Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System
Gokul Chittaranjan | Yogarshi Vyas | Kalika Bali | Monojit Choudhury
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook
Kalika Bali | Jatin Sharma | Monojit Choudhury | Yogarshi Vyas
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
“ye word kis lang ka hai bhai?” Testing the Limits of Word level Language Identification
Spandana Gella | Kalika Bali | Monojit Choudhury
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
POS Tagging of English-Hindi Code-Mixed Social Media Content
Yogarshi Vyas | Spandana Gella | Jatin Sharma | Kalika Bali | Monojit Choudhury
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Entailment: An Effective Metric for Comparing and Evaluating Hierarchical and Non-hierarchical Annotation Schemes
Rohan Ramanath | Monojit Choudhury | Kalika Bali
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

pdf bib
Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation
Rohan Ramanath | Monojit Choudhury | Kalika Bali | Rishiraj Saha Roy
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib
Mining Hindi-English Transliteration Pairs from Online Hindi Lyrics
Kanika Gupta | Monojit Choudhury | Kalika Bali
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes a method to mine Hindi-English transliteration pairs from online Hindi song lyrics. The technique is based on the observations that lyrics are transliterated word-by-word, maintaining the precise word order. The mining task is nevertheless challenging because the Hindi lyrics and its transliterations are usually available from different, often unrelated, websites. Therefore, it is a non-trivial task to match the Hindi lyrics to their transliterated counterparts. Moreover, there are various types of noise in lyrics data that needs to be appropriately handled before songs can be aligned at word level. The mined data of 30823 unique Hindi-English transliteration pairs with an accuracy of more than 92% is available publicly. Although the present work reports mining of Hindi-English word pairs, the same technique can be easily adapted for other languages for which song lyrics are available online in native and Roman scripts.

pdf bib
Proceedings of the Second Workshop on Advances in Text Input Methods
Kalika Bali | Monojit Choudhury | Yoh Okuno
Proceedings of the Second Workshop on Advances in Text Input Methods

2011

pdf bib
Challenges in Designing Input Method Editors for Indian Lan-guages: The Role of Word-Origin and Context
Umair Z Ahmed | Kalika Bali | Monojit Choudhury | Sowmya VB
Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011)

2010

pdf bib
Resource Creation for Training and Testing of Transliteration Systems for Indian Languages
Sowmya V. B. | Monojit Choudhury | Kalika Bali | Tirthankar Dasgupta | Anupam Basu
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Machine transliteration is used in a number of NLP applications ranging from machine translation and information retrieval to input mechanisms for non-roman scripts. Many popular Input Method Editors for Indian languages, like Baraha, Akshara, Quillpad etc, use back-transliteration as a mechanism to allow users to input text in a number of Indian language. The lack of a standard dataset to evaluate these systems makes it difficult to make any meaningful comparisons of their relative accuracies. In this paper, we describe the methodology for the creation of a dataset of ~2500 transliterated sentence pairs each in Bangla, Hindi and Telugu. The data was collected across three different modes from a total of 60 users. We believe that this dataset will prove useful not only for the evaluation and training of back-transliteration systems but also help in the linguistic analysis of the process of transliterating Indian languages from native scripts to Roman.

2009

pdf bib
Complex Linguistic Annotation – No Easy Way Out! A Case from Bangla and Hindi POS Labeling Tasks
Sandipan Dandapat | Priyanka Biswas | Monojit Choudhury | Kalika Bali
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2008

pdf bib
A Common Parts-of-Speech Tagset Framework for Indian Languages
Baskaran Sankaran | Kalika Bali | Monojit Choudhury | Tanmoy Bhattacharya | Pushpak Bhattacharyya | Girish Nath Jha | S. Rajendran | K. Saravanan | L. Sobha | K.V. Subbarao
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languages (ILs) following the hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; the few that have been designed for multiple languages cover only shallow linguistic features ignoring linguistic richness and the idiosyncrasies. The new framework that is proposed here addresses these deficiencies in an efficient and principled manner. We follow a hierarchical schema similar to that of EAGLES and this enables the framework to be flexible enough to capture rich features of a language/ language family, even while capturing the shared linguistic structures in a methodical way. The proposed common framework further facilitates the sharing and reusability of scarce resources in these languages and ensures cross-linguistic compatibility.

pdf bib
Designing a Common POS-Tagset Framework for Indian Languages
Sankaran Baskaran | Kalika Bali | Tanmoy Bhattacharya | Pushpak Bhattacharyya | Girish Nath Jha | Rajendran S | Saravanan K | Sobha L | Subbarao K V.
Proceedings of the 6th Workshop on Asian Language Resources

2004

pdf bib
Automatic Generation of Compound Word Lexicon for Hindi Speech Synthesis
S.R. Deepa | Kalika Bali | A.G. Ramakrishnan | Partha Pratim Talukdar
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)