Diptesh Kanojia


2020

pdf bib
“A Passage to India”: Pre-trained Word Embeddings for Indian Languages
Saurav Kumar | Saunack Kumar | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Dense word vectors or ‘word embeddings’ which encode semantic properties of words, have now become integral to NLP tasks like Machine Translation (MT), Question Answering (QA), Word Sense Disambiguation (WSD), and Information Retrieval (IR). In this paper, we use various existing approaches to create multiple word embeddings for 14 Indian languages. We place these embeddings for all these languages, viz., Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani, Malayalam, Marathi, Nepali, Odiya, Punjabi, Sanskrit, Tamil, and Telugu in a single repository. Relatively newer approaches that emphasize catering to context (BERT, ELMo, etc.) have shown significant improvements, but require a large amount of resources to generate usable models. We release pre-trained embeddings generated using both contextual and non-contextual approaches. We also use MUSE and XLM to train cross-lingual embeddings for all pairs of the aforementioned languages. To show the efficacy of our embeddings, we evaluate our embedding models on XPOS, UPOS and NER tasks for all these languages. We release a total of 436 models using 8 different approaches. We hope they are useful for the resource-constrained Indian language NLP. The title of this paper refers to the famous novel “A Passage to India” by E.M. Forster, published initially in 1924.

pdf bib
Challenge Dataset of Cognates and False Friend Pairs from Indian Languages
Diptesh Kanojia | Malhar Kulkarni | Pushpak Bhattacharyya | Gholamreza Haffari
Proceedings of the 12th Language Resources and Evaluation Conference

Cognates are present in multiple variants of the same text across different languages (e.g., “hund” in German and “hound” in the English language mean “dog”). They pose a challenge to various Natural Language Processing (NLP) applications such as Machine Translation, Cross-lingual Sense Disambiguation, Computational Phylogenetics, and Information Retrieval. A possible solution to address this challenge is to identify cognates across language pairs. In this paper, we describe the creation of two cognate datasets for twelve Indian languages namely Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. We digitize the cognate data from an Indian language cognate dictionary and utilize linked Indian language Wordnets to generate cognate sets. Additionally, we use the Wordnet data to create a False Friends’ dataset for eleven language pairs. We also evaluate the efficacy of our dataset using previously available baseline cognate detection approaches. We also perform a manual evaluation with the help of lexicographers and release the curated gold-standard dataset with this paper.

pdf bib
Recommendation Chart of Domains for Cross-Domain Sentiment Analysis: Findings of A 20 Domain Study
Akash Sheoran | Diptesh Kanojia | Aditya Joshi | Pushpak Bhattacharyya
Proceedings of the 12th Language Resources and Evaluation Conference

Cross-domain sentiment analysis (CDSA) helps to address the problem of data scarcity in scenarios where labelled data for a domain (known as the target domain) is unavailable or insufficient. However, the decision to choose a domain (known as the source domain) to leverage from is, at best, intuitive. In this paper, we investigate text similarity metrics to facilitate source domain selection for CDSA. We report results on 20 domains (all possible pairs) using 11 similarity metrics. Specifically, we compare CDSA performance with these metrics for different domain-pairs to enable the selection of a suitable source domain, given a target domain. These metrics include two novel metrics for evaluating domain adaptability to help source domain selection of labelled data and utilize word and sentence-based embeddings as metrics for unlabelled data. The goal of our experiments is a recommendation chart that gives the K best source domains for CDSA for a given target domain. We show that the best K source domains returned by our similarity metrics have a precision of over 50%, for varying values of K.

pdf bib
Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages
Diptesh Kanojia | Raj Dabre | Shubham Dewangan | Pushpak Bhattacharyya | Gholamreza Haffari | Malhar Kulkarni
Proceedings of the 28th International Conference on Computational Linguistics

Cognates are variants of the same lexical form across different languages; for example “fonema” in Spanish and “phoneme” in English are cognates, both of which mean “a unit of sound”. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages. Our approach introduces the use of context from a knowledge graph to generate improved feature representations for cognate detection. We, then, evaluate the impact of our cognate detection mechanism on neural machine translation (NMT), as a downstream task. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. Additionally, we create evaluation datasets for two more Indian languages, Konkani and Nepali. We observe an improvement of up to 18% points, in terms of F-score, for cognate detection. Furthermore, we observe that cognates extracted using our method help improve NMT quality by up to 2.76 BLEU. We also release our code, newly constructed datasets and cross-lingual models publicly.

2019

pdf bib
Utilizing Word Embeddings based Features for Phylogenetic Tree Generation of Sanskrit Texts
Diptesh Kanojia | Abhijeet Dubey | Malhar Kulkarni | Pushpak Bhattacharyya | Gholemreza Haffari
Proceedings of the 6th International Sanskrit Computational Linguistics Symposium

pdf bib
An Introduction to the Textual History Tool
Diptesh Kanojia | Malhar Kulkarni | Pushpak Bhattacharyya | Eivind Kahrs
Proceedings of the 6th International Sanskrit Computational Linguistics Symposium

2018

pdf bib
Indian Language Wordnets and their Linkages with Princeton WordNet
Diptesh Kanojia | Kevin Patel | Pushpak Bhattacharyya
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Eyes are the Windows to the Soul: Predicting the Rating of Text Quality Using Gaze Behaviour
Sandeep Mathias | Diptesh Kanojia | Kevin Patel | Samarth Agrawal | Abhijit Mishra | Pushpak Bhattacharyya
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Predicting a reader’s rating of text quality is a challenging task that involves estimating different subjective aspects of the text, like structure, clarity, etc. Such subjective aspects are better handled using cognitive information. One such source of cognitive information is gaze behaviour. In this paper, we show that gaze behaviour does indeed help in effectively predicting the rating of text quality. To do this, we first we model text quality as a function of three properties - organization, coherence and cohesion. Then, we demonstrate how capturing gaze behaviour helps in predicting each of these properties, and hence the overall quality, by reporting improvements obtained by adding gaze features to traditional textual features for score prediction. We also hypothesize that if a reader has fully understood the text, the corresponding gaze behaviour would give a better indication of the assigned rating, as opposed to partial understanding. Our experiments validate this hypothesis by showing greater agreement between the given rating and the predicted rating when the reader has a full understanding of the text.

2017

pdf bib
Is your Statement Purposeless? Predicting Computer Science Graduation Admission Acceptance based on Statement Of Purpose
Diptesh Kanojia | Nikhil Wani | Pushpak Bhattacharyya
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

pdf bib
That’ll Do Fine!: A Coarse Lexical Resource for English-Hindi MT, Using Polylingual Topic Models
Diptesh Kanojia | Aditya Joshi | Pushpak Bhattacharyya | Mark James Carman
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Parallel corpora are often injected with bilingual lexical resources for improved Indian language machine translation (MT). In absence of such lexical resources, multilingual topic models have been used to create coarse lexical resources in the past, using a Cartesian product approach. Our results show that for morphologically rich languages like Hindi, the Cartesian product approach is detrimental for MT. We then present a novel ‘sentential’ approach to use this coarse lexical resource from a multilingual topic model. Our coarse lexical resource when injected with a parallel corpus outperforms a system trained using parallel corpus and a good quality lexical resource. As demonstrated by the quality of our coarse lexical resource and its benefit to MT, we believe that our sentential approach to create such a resource will help MT for resource-constrained languages.

pdf bib
SlangNet: A WordNet like resource for English Slang
Shehzaad Dhuliawala | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a WordNet like structured resource for slang words and neologisms on the internet. The dynamism of language is often an indication that current language technology tools trained on today’s data, may not be able to process the language in the future. Our resource could be (1) used to augment the WordNet, (2) used in several Natural Language Processing (NLP) applications which make use of noisy data on the internet like Information Retrieval and Web Mining. Such a resource can also be used to distinguish slang word senses from conventional word senses. To stimulate similar innovations widely in the NLP community, we test the efficacy of our resource for detecting slang using standard bag of words Word Sense Disambiguation (WSD) algorithms (Lesk and Extended Lesk) for English data on the internet.

pdf bib
Leveraging Cognitive Features for Sentiment Analysis
Abhijit Mishra | Diptesh Kanojia | Seema Nagar | Kuntal Dey | Pushpak Bhattacharyya
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib
Harnessing Cognitive Features for Sarcasm Detection
Abhijit Mishra | Diptesh Kanojia | Seema Nagar | Kuntal Dey | Pushpak Bhattacharyya
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
TransChat: Cross-Lingual Instant Messaging for Indian Languages
Diptesh Kanojia | Shehzaad Dhuliawala | Abhijit Mishra | Naman Gupta | Pushpak Bhattacharyya
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Using Multilingual Topic Models for Improved Alignment in English-Hindi MT
Diptesh Kanojia | Aditya Joshi | Pushpak Bhattacharyya | Mark James Carman
Proceedings of the 12th International Conference on Natural Language Processing

2014

pdf bib
Do not do processing, when you can look up: Towards a Discrimination Net for WSD
Diptesh Kanojia | Pushpak Bhattacharyya | Raj Dabre | Siddhartha Gunti | Manish Shrivastava
Proceedings of the Seventh Global Wordnet Conference

pdf bib
PaCMan : Parallel Corpus Management Workbench
Diptesh Kanojia | Manish Shrivastava | Raj Dabre | Pushpak Bhattacharyya
Proceedings of the 11th International Conference on Natural Language Processing

2013

pdf bib
More than meets the eye: Study of Human Cognition in Sense Annotation
Salil Joshi | Diptesh Kanojia | Pushpak Bhattacharyya
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf bib
Discrimination-Net for Hindi
Diptesh Kanojia | Arindam Chatterjee | Salil Joshi | Pushpak Bhattacharyya
Proceedings of COLING 2012: Demonstration Papers