Monojit Choudhury


2020

pdf bib
TaxiNLI: Taking a Ride up the NLU Hill
Pratik Joshi | Somak Aditya | Aalok Sathe | Monojit Choudhury
Proceedings of the 24th Conference on Computational Natural Language Learning

Pre-trained Transformer-based neural architectures have consistently achieved state-of-the-art performance in the Natural Language Inference (NLI) task. Since NLI examples encompass a variety of linguistic, logical, and reasoning phenomena, it remains unclear as to which specific concepts are learnt by the trained systems and where they can achieve strong generalization. To investigate this question, we propose a taxonomic hierarchy of categories that are relevant for the NLI task. We introduce TaxiNLI, a new dataset, that has 10k examples from the MNLI dataset with these taxonomic labels. Through various experiments on TaxiNLI, we observe that whereas for certain taxonomic categories SOTA neural models have achieved near perfect accuracies—a large jump over the previous models—some categories still remain difficult. Our work adds to the growing body of literature that shows the gaps in the current NLI systems and datasets through a systematic presentation and analysis of reasoning categories.

pdf bib
GLUECoS: An Evaluation Benchmark for Code-Switched NLP
Simran Khanuja | Sandipan Dandapat | Anirudh Srinivasan | Sunayana Sitaram | Monojit Choudhury
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Code-switching is the use of more than one language in the same conversation or utterance. Recently, multilingual contextual embedding models, trained on multiple monolingual corpora, have shown promising results on cross-lingual and multilingual tasks. We present an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish. Specifically, our evaluation benchmark includes Language Identification from text, POS tagging, Named Entity Recognition, Sentiment Analysis, Question Answering and a new task for code-switching, Natural Language Inference. We present results on all these tasks using cross-lingual word embedding models and multilingual models. In addition, we fine-tune multilingual models on artificially generated code-switched data. Although multilingual models perform significantly better than cross-lingual models, our results show that in most tasks, across both language pairs, multilingual models fine-tuned on code-switched data perform best, showing that multilingual models can be further optimized for code-switching tasks.

pdf bib
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
Pratik Joshi | Sebastin Santy | Amar Budhiraja | Kalika Bali | Monojit Choudhury
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Our quantitative investigation underlines the disparity between languages, especially in terms of their resources, and calls into question the “language agnostic” status of current models and systems. Through this paper, we attempt to convince the ACL community to prioritise the resolution of the predicaments highlighted here, so that no language is left behind.

pdf bib
Crowdsourcing Speech Data for Low-Resource Languages from Low-Income Workers
Basil Abraham | Danish Goel | Divya Siddarth | Kalika Bali | Manu Chopra | Monojit Choudhury | Pratik Joshi | Preethi Jyoti | Sunayana Sitaram | Vivek Seshadri
Proceedings of the 12th Language Resources and Evaluation Conference

Voice-based technologies are essential to cater to the hundreds of millions of new smartphone users. However, most of the languages spoken by these new users have little to no labelled speech data. Unfortunately, collecting labelled speech data in any language is an expensive and resource-intensive task. Moreover, existing platforms typically collect speech data only from urban speakers familiar with digital technology whose dialects are often very different from low-income users. In this paper, we explore the possibility of collecting labelled speech data directly from low-income workers. In addition to providing diversity to the speech dataset, we believe this approach can also provide valuable supplemental earning opportunities to these communities. To this end, we conducted a study where we collected labelled speech data in the Marathi language from three different user groups: low-income rural users, low-income urban users, and university students. Overall, we collected 109 hours of data from 36 participants. Our results show that the data collected from low-income participants is of comparable quality to the data collected from university students (who are typically employed to do this work) and that crowdsourcing speech data from low-income rural and urban workers is a viable method of gathering speech data.

pdf bib
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching
Thamar Solorio | Monojit Choudhury | Kalika Bali | Sunayana Sitaram | Amitava Das | Mona Diab
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching

pdf bib
A New Dataset for Natural Language Inference from Code-mixed Conversations
Simran Khanuja | Sandipan Dandapat | Sunayana Sitaram | Monojit Choudhury
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching

Natural Language Inference (NLI) is the task of inferring the logical relationship, typically entailment or contradiction, between a premise and hypothesis. Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world. In this paper, we present the first dataset for code-mixed NLI, in which both the premises and hypotheses are in code-mixed Hindi-English. We use data from Hindi movies (Bollywood) as premises, and crowd-source hypotheses from Hindi-English bilinguals. We conduct a pilot annotation study and describe the final annotation protocol based on observations from the pilot. Currently, the data collected consists of 400 premises in the form of code-mixed conversation snippets and 2240 code-mixed hypotheses. We conduct an extensive analysis to infer the linguistic phenomena commonly observed in the dataset obtained. We evaluate the dataset using a standard mBERT-based pipeline for NLI and report results.

pdf bib
Understanding Script-Mixing: A Case Study of Hindi-English Bilingual Twitter Users
Abhishek Srivastava | Kalika Bali | Monojit Choudhury
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching

In a multi-lingual and multi-script society such as India, many users resort to code-mixing while typing on social media. While code-mixing has received a lot of attention in the past few years, it has mostly been studied within a single-script scenario. In this work, we present a case study of Hindi-English bilingual Twitter users while considering the nuances that come with the intermixing of different scripts. We present a concise analysis of how scripts and languages interact in communities and cultures where code-mixing is rampant and offer certain insights into the findings. Our analysis shows that both intra-sentential and inter-sentential script-mixing are present on Twitter and show different behavior in different contexts. Examples suggest that script can be employed as a tool for emphasizing certain phrases within a sentence or disambiguating the meaning of a word. Script choice can also be an indicator of whether a word is borrowed or not. We present our analysis along with examples that bring out the nuances of the different cases.

pdf bib
Code-mixed parse trees and how to find them
Anirudh Srinivasan | Sandipan Dandapat | Monojit Choudhury
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching

In this paper, we explore the methods of obtaining parse trees of code-mixed sentences and analyse the obtained trees. Existing work has shown that linguistic theories can be used to generate code-mixed sentences from a set of parallel sentences. We build upon this work, using one of these theories, the Equivalence-Constraint theory to obtain the parse trees of synthetically generated code-mixed sentences and evaluate them with a neural constituency parser. We highlight the lack of a dataset non-synthetic code-mixed constituency parse trees and how it makes our evaluation difficult. To complete our evaluation, we convert a code-mixed dependency parse tree set into “pseudo constituency trees” and find that a parser trained on synthetically generated trees is able to decently parse these as well.

2019

bib
Processing and Understanding Mixed Language Data
Monojit Choudhury | Anirudh Srinivasan | Sandipan Dandapat
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts

Multilingual communities exhibit code-mixing, that is, mixing of two or more socially stable languages in a single conversation, sometimes even in a single utterance. This phenomenon has been widely studied by linguists and interaction scientists in the spoken language of such communities. However, with the prevalence of social media and other informal interactive platforms, code-switching is now also ubiquitously observed in user-generated text. As multilingual communities are more the norm from a global perspective, it becomes essential that code-switched text and speech are adequately handled by language technologies and NUIs.Code-mixing is extremely prevalent in all multilingual societies. Current studies have shown that as much as 20% of user generated content from some geographies, like South Asia, parts of Europe, and Singapore, are code-mixed. Thus, it is very important to handle code-mixed content as a part of NLP systems and applications for these geographies.In the past 5 years, there has been an active interest in computational models for code-mixing with a substantive research outcome in terms of publications, datasets and systems. However, it is not easy to find a single point of access for a complete and coherent overview of the research. This tutorial is expecting to fill this gap and provide new researchers in the area with a foundation in both linguistic and computational aspects of code-mixing. We hope that this then becomes a starting point for those who wish to pursue research, design, development and deployment of code-mixed systems in multilingual societies.

pdf bib
INMT: Interactive Neural Machine Translation Prediction
Sebastin Santy | Sandipan Dandapat | Monojit Choudhury | Kalika Bali
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

In this paper, we demonstrate an Interactive Machine Translation interface, that assists human translators with on-the-fly hints and suggestions. This makes the end-to-end translation process faster, more efficient and creates high-quality translations. We augment the OpenNMT backend with a mechanism to accept the user input and generate conditioned translations.

2018

pdf bib
An Integrated Representation of Linguistic and Social Functions of Code-Switching
Silvana Hartmann | Monojit Choudhury | Kalika Bali
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Discovering Canonical Indian English Accents: A Crowdsourcing-based Approach
Sunayana Sitaram | Varun Manjunath | Varun Bharadwaj | Monojit Choudhury | Kalika Bali | Michael Tjalve
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Word Embeddings for Code-Mixed Language Processing
Adithya Pratapa | Monojit Choudhury | Sunayana Sitaram
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We compare three existing bilingual word embedding approaches, and a novel approach of training skip-grams on synthetic code-mixed text generated through linguistic models of code-mixing, on two tasks - sentiment analysis and POS tagging for code-mixed text. Our results show that while CVM and CCA based embeddings perform as well as the proposed embedding technique on semantic and syntactic tasks respectively, the proposed approach provides the best performance for both tasks overall. Thus, this study demonstrates that existing bilingual embedding techniques are not ideal for code-mixed text processing and there is a need for learning multilingual word embedding from the code-mixed text.

pdf bib
Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data
Adithya Pratapa | Gayatri Bhat | Monojit Choudhury | Sunayana Sitaram | Sandipan Dandapat | Kalika Bali
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Training language models for Code-mixed (CM) language is known to be a difficult problem because of lack of data compounded by the increased confusability due to the presence of more than one language. We present a computational technique for creation of grammatically valid artificial CM data based on the Equivalence Constraint Theory. We show that when training examples are sampled appropriately from this synthetic data and presented in certain order (aka training curriculum) along with monolingual and real CM data, it can significantly reduce the perplexity of an RNN-based language model. We also show that randomly generated CM data does not help in decreasing the perplexity of the LMs.

pdf bib
Phone Merging For Code-Switched Speech Recognition
Sunit Sivasankaran | Brij Mohan Lal Srivastava | Sunayana Sitaram | Kalika Bali | Monojit Choudhury
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

Speakers in multilingual communities often switch between or mix multiple languages in the same conversation. Automatic Speech Recognition (ASR) of code-switched speech faces many challenges including the influence of phones of different languages on each other. This paper shows evidence that phone sharing between languages improves the Acoustic Model performance for Hindi-English code-switched speech. We compare baseline system built with separate phones for Hindi and English with systems where the phones were manually merged based on linguistic knowledge. Encouraged by the improved ASR performance after manually merging the phones, we further investigate multiple data-driven methods to identify phones to be merged across the languages. We show detailed analysis of automatic phone merging in this language pair and the impact it has on individual phone accuracies and WER. Though the best performance gain of 1.2% WER was observed with manually merged phones, we show experimentally that the manual phone merge is not optimal.

pdf bib
Accommodation of Conversational Code-Choice
Anshul Bawa | Monojit Choudhury | Kalika Bali
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

Bilingual speakers often freely mix languages. However, in such bilingual conversations, are the language choices of the speakers coordinated? How much does one speaker’s choice of language affect other speakers? In this paper, we formulate code-choice as a linguistic style, and show that speakers are indeed sensitive to and accommodating of each other’s code-choice. We find that the saliency or markedness of a language in context directly affects the degree of accommodation observed. More importantly, we discover that accommodation of code-choices persists over several conversational turns. We also propose an alternative interpretation of conversational accommodation as a retrieval problem, and show that the differences in accommodation characteristics of code-choices are based on their markedness in context.

2017

pdf bib
Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique
Shruti Rijhwani | Royal Sequiera | Monojit Choudhury | Kalika Bali | Chandra Shekhar Maddila
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Word-level language detection is necessary for analyzing code-switched text, where multiple languages could be mixed within a sentence. Existing models are restricted to code-switching between two specific languages and fail in real-world scenarios as text input rarely has a priori information on the languages used. We present a novel unsupervised word-level language detection technique for code-switched text for an arbitrarily large number of languages, which does not require any manually annotated training data. Our experiments with tweets in seven languages show a 74% relative error reduction in word-level labeling with respect to competitive baselines. We then use this system to conduct a large-scale quantitative analysis of code-switching patterns on Twitter, both global as well as region-specific, with 58M tweets.

pdf bib
Curriculum Design for Code-switching: Experiments with Language Identification and Language Modeling with Deep Neural Networks
Monojit Choudhury | Kalika Bali | Sunayana Sitaram | Ashutosh Baheti
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf bib
Quantitative Characterization of Code Switching Patterns in Complex Multi-Party Conversations: A Case Study on Hindi Movie Scripts
Adithya Pratapa | Monojit Choudhury
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf bib
All that is English may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media
Jasabanta Patro | Bidisha Samanta | Saurabh Singh | Abhipsa Basu | Prithwish Mukherjee | Monojit Choudhury | Animesh Mukherjee
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

n this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman’s correlation values, our methods perform more than two times better (∼ 0.62) in predicting the borrowing likeliness compared to the best performing baseline (∼ 0.26) reported in literature. Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts. In 88% of cases the annotators felt that the foreign language tag should be replaced by native language tag, thus indicating a huge scope for improvement of automatic language identification systems.

2016

pdf bib
Functions of Code-Switching in Tweets: An Annotation Framework and Some Initial Experiments
Rafiya Begum | Kalika Bali | Monojit Choudhury | Koustav Rudra | Niloy Ganguly
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Code-Switching (CS) between two languages is extremely common in communities with societal multilingualism where speakers switch between two or more languages when interacting with each other. CS has been extensively studied in spoken language by linguists for several decades but with the popularity of social-media and less formal Computer Mediated Communication, we now see a big rise in the use of CS in the text form. This poses interesting challenges and a need for computational processing of such code-switched data. As with any Computational Linguistic analysis and Natural Language Processing tools and applications, we need annotated data for understanding, processing, and generation of code-switched language. In this study, we focus on CS between English and Hindi Tweets extracted from the Twitter stream of Hindi-English bilinguals. We present an annotation scheme for annotating the pragmatic functions of CS in Hindi-English (Hi-En) code-switched tweets based on a linguistic analysis and some initial experiments.

pdf bib
Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?
Koustav Rudra | Shruti Rijhwani | Rafiya Begum | Kalika Bali | Monojit Choudhury | Niloy Ganguly
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

pdf bib
POS Tagging of Hindi-English Code Mixed Text from Social Media: Some Machine Learning Experiments
Royal Sequiera | Monojit Choudhury | Kalika Bali
Proceedings of the 12th International Conference on Natural Language Processing

2014

pdf bib
Word-level Language Identification using CRF: Code-switching Shared Task Report of MSR India System
Gokul Chittaranjan | Yogarshi Vyas | Kalika Bali | Monojit Choudhury
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
I am borrowing ya mixing ?" An Analysis of English-Hindi Code Mixing in Facebook
Kalika Bali | Jatin Sharma | Monojit Choudhury | Yogarshi Vyas
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
Hierarchical Recursive Tagset for Annotating Cooking Recipes
Sharath Reddy Gunamgari | Sandipan Dandapat | Monojit Choudhury
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
“ye word kis lang ka hai bhai?” Testing the Limits of Word level Language Identification
Spandana Gella | Kalika Bali | Monojit Choudhury
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
Automatic Discovery of Adposition Typology
Rishiraj Saha Roy | Rahul Katare | Niloy Ganguly | Monojit Choudhury
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
POS Tagging of English-Hindi Code-Mixed Social Media Content
Yogarshi Vyas | Spandana Gella | Jatin Sharma | Kalika Bali | Monojit Choudhury
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Entailment: An Effective Metric for Comparing and Evaluating Hierarchical and Non-hierarchical Annotation Schemes
Rohan Ramanath | Monojit Choudhury | Kalika Bali
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

pdf bib
Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation
Rohan Ramanath | Monojit Choudhury | Kalika Bali | Rishiraj Saha Roy
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib
An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora
K Saravanan | Monojit Choudhury | Raghavendra Udupa | A Kumaran
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Named Entities (NEs) that occur in natural language text are important especially due to the advent of social media, and they play a critical role in the development of many natural language technologies. In this paper, we systematically analyze the patterns of occurrence and co-occurrence of NEs in standard large English news corpora - providing valuable insight for the understanding of the corpus, and subsequently paving way for the development of technologies that rely critically on handling NEs. We use two distinctive approaches: normal statistical analysis that measure and report the occurrence patterns of NEs in terms of frequency, growth, etc., and a complex networks based analysis that measures the co-occurrence pattern in terms of connectivity, degree-distribution, small-world phenomenon, etc. Our analysis indicates that: (i) NEs form an open-set in corpora and grow linearly, (ii) presence of a kernel and peripheral NE's, with the large periphery occurring rarely, and (iii) a strong evidence of small-world phenomenon. Our findings may suggest effective ways for construction of NE lexicons to aid efficient development of several natural language technologies.

pdf bib
Mining Hindi-English Transliteration Pairs from Online Hindi Lyrics
Kanika Gupta | Monojit Choudhury | Kalika Bali
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes a method to mine Hindi-English transliteration pairs from online Hindi song lyrics. The technique is based on the observations that lyrics are transliterated word-by-word, maintaining the precise word order. The mining task is nevertheless challenging because the Hindi lyrics and its transliterations are usually available from different, often unrelated, websites. Therefore, it is a non-trivial task to match the Hindi lyrics to their transliterated counterparts. Moreover, there are various types of noise in lyrics data that needs to be appropriately handled before songs can be aligned at word level. The mined data of 30823 unique Hindi-English transliteration pairs with an accuracy of more than 92% is available publicly. Although the present work reports mining of Hindi-English word pairs, the same technique can be easily adapted for other languages for which song lyrics are available online in native and Roman scripts.

pdf bib
Proceedings of the Second Workshop on Advances in Text Input Methods
Kalika Bali | Monojit Choudhury | Yoh Okuno
Proceedings of the Second Workshop on Advances in Text Input Methods

2011

pdf bib
Challenges in Designing Input Method Editors for Indian Lan-guages: The Role of Word-Origin and Context
Umair Z Ahmed | Kalika Bali | Monojit Choudhury | Sowmya VB
Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011)

2010

pdf bib
Global topology of word co-occurrence networks: Beyond the two-regime power-law
Monojit Choudhury | Diptesh Chatterjee | Animesh Mukherjee
Coling 2010: Posters

pdf bib
Resource Creation for Training and Testing of Transliteration Systems for Indian Languages
Sowmya V. B. | Monojit Choudhury | Kalika Bali | Tirthankar Dasgupta | Anupam Basu
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Machine transliteration is used in a number of NLP applications ranging from machine translation and information retrieval to input mechanisms for non-roman scripts. Many popular Input Method Editors for Indian languages, like Baraha, Akshara, Quillpad etc, use back-transliteration as a mechanism to allow users to input text in a number of Indian language. The lack of a standard dataset to evaluate these systems makes it difficult to make any meaningful comparisons of their relative accuracies. In this paper, we describe the methodology for the creation of a dataset of ~2500 transliterated sentence pairs each in Bangla, Hindi and Telugu. The data was collected across three different modes from a total of 60 users. We believe that this dataset will prove useful not only for the evaluation and training of back-transliteration systems but also help in the linguistic analysis of the process of transliterating Indian languages from native scripts to Roman.

2009

pdf bib
Language Diversity across the Consonant Inventories: A Study in the Framework of Complex Networks
Monojit Choudhury | Animesh Mukherjee | Anupam Basu | Niloy Ganguly | Ashish Garg | Vaibhav Jalan
Proceedings of the EACL 2009 Workshop on Cognitive Aspects of Computational Language Acquisition

pdf bib
Complex Linguistic Annotation – No Easy Way Out! A Case from Bangla and Hindi POS Labeling Tasks
Sandipan Dandapat | Priyanka Biswas | Monojit Choudhury | Kalika Bali
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf bib
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing (TextGraphs-4)
Monojit Choudhury | Samer Hassan | Animesh Mukherjee | Smaranda Muresan
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing (TextGraphs-4)

pdf bib
Syntax is from Mars while Semantics from Venus! Insights from Spectral Analysis of Distributional Similarity Networks
Chris Biemann | Monojit Choudhury | Animesh Mukherjee
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf bib
Large-Coverage Root Lexicon Extraction for Hindi
Cohan Sujay Carlos | Monojit Choudhury | Sandipan Dandapat
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Discovering Global Patterns in Linguistic Networks through Spectral Analysis: A Case Study of the Consonant Inventories
Animesh Mukherjee | Monojit Choudhury | Ravi Kannan
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

pdf bib
Unsupervised Parts-of-Speech Induction for Bengali
Joydeep Nath | Monojit Choudhury | Animesh Mukherjee | Christian Biemann | Niloy Ganguly
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present a study of the word interaction networks of Bengali in the framework of complex networks. The topological properties of these networks reveal interesting insights into the morpho-syntax of the language, whereas clustering helps in the induction of the natural word classes leading to a principled way of designing POS tagsets. We compare different network construction techniques and clustering algorithms based on the cohesiveness of the word clusters. Cohesiveness is measured against two gold-standard tagsets by means of the novel metric of tag-entropy. The approach presented here is a generic one that can be easily extended to any language.

pdf bib
A Common Parts-of-Speech Tagset Framework for Indian Languages
Baskaran Sankaran | Kalika Bali | Monojit Choudhury | Tanmoy Bhattacharya | Pushpak Bhattacharyya | Girish Nath Jha | S. Rajendran | K. Saravanan | L. Sobha | K.V. Subbarao
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languages (ILs) following the hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; the few that have been designed for multiple languages cover only shallow linguistic features ignoring linguistic richness and the idiosyncrasies. The new framework that is proposed here addresses these deficiencies in an efficient and principled manner. We follow a hierarchical schema similar to that of EAGLES and this enables the framework to be flexible enough to capture rich features of a language/ language family, even while capturing the shared linguistic structures in a methodical way. The proposed common framework further facilitates the sharing and reusability of scarce resources in these languages and ensures cross-linguistic compatibility.

pdf bib
Social Network Inspired Models of NLP and Language Evolution
Monojit Choudhury | Animesh Mukherjee | Niloy Ganguly
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib
Invited Talk: Breaking the Zipfian Barrier of NLP
Monojit Choudhury
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages

pdf bib
Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing
Irina Matveeva | Chris Biemann | Monojit Choudhury | Mona Diab
Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing

pdf bib
Modeling the Structure and Dynamics of the Consonant Inventories: A Complex Network Approach
Animesh Mukherjee | Monojit Choudhury | Anupam Basu | Niloy Ganguly
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf bib
How Difficult is it to Develop a Perfect Spell-checker? A Cross-Linguistic Analysis through Complex Network Approach
Monojit Choudhury | Markose Thomas | Animesh Mukherjee | Anupam Basu | Niloy Ganguly
Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing

pdf bib
Evolution, Optimization, and Language Change: The Case of Bengali Verb Inflections
Monojit Choudhury | Vaibhav Jalan | Sudeshna Sarkar | Anupam Basu
Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology

pdf bib
Emergence of Community Structures in Vowel Inventories: An Analysis Based on Complex Networks
Animesh Mukherjee | Monojit Choudhury | Anupam Basu | Niloy Ganguly
Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology

pdf bib
Redundancy Ratio: An Invariant Property of the Consonant Inventories of the World’s Languages
Animesh Mukherjee | Monojit Choudhury | Anupam Basu | Niloy Ganguly
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf bib
Analysis and Synthesis of the Distribution of Consonants over Languages: A Complex Network Approach
Monojit Choudhury | Animesh Mukherjee | Anupam Basu | Niloy Ganguly
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2004

pdf bib
A Diachronic Approach for Schwa Deletion in Indo Aryan Languages
Monojit Choudhury | Anupam Basu | Sudeshna Sarkar
Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology