## AnupamBasu

#### 2014

In this paper we have developed an open-source online computational framework that can be used by different research groups to conduct reading researches on Indian language texts. The framework can be used to develop a large annotated Indian language text comprehension data from different user based experiments. The novelty in this framework lies in the fact that it brings different empirical data-collection techniques for text comprehension under one roof. The framework has been customized specifically to address language particularities for Indian languages. It will also offer many types of automatic analysis on the data at different levels such as full text, sentence and word level. To address the subjectivity of text difficulty perception, the framework allows to capture user background against multiple factors. The assimilated data can be automatically cross referenced against varying strata of readers.

#### 2010

The paper presents a new fuzzy agreement measure $\gamma_f$ for determining the agreement in multi-label and subjective annotation task. In this annotation framework, one data item may belong to a category or a class with a belief value denoting the degree of confidence of an annotator in assigning the data item to that category. We have provided a notion of disagreement based on the belief values provided by the annotators with respect to a category. The fuzzy agreement measure $\gamma_f$ has been proposed by defining different fuzzy agreement sets based on the distribution of difference of belief values provided by the annotators. The fuzzy agreement has been computed by studying the average agreement over all the data items and annotators. Finally, we elaborate on the computation $\gamma_f$ measure with a case study on emotion text data where a data item (sentence) may belong to more than one emotion category with varying belief values.
Machine transliteration is used in a number of NLP applications ranging from machine translation and information retrieval to input mechanisms for non-roman scripts. Many popular Input Method Editors for Indian languages, like Baraha, Akshara, Quillpad etc, use back-transliteration as a mechanism to allow users to input text in a number of Indian language. The lack of a standard dataset to evaluate these systems makes it difficult to make any meaningful comparisons of their relative accuracies. In this paper, we describe the methodology for the creation of a dataset of ~2500 transliterated sentence pairs each in Bangla, Hindi and Telugu. The data was collected across three different modes from a total of 60 users. We believe that this dataset will prove useful not only for the evaluation and training of back-transliteration systems but also help in the linguistic analysis of the process of transliterating Indian languages from native scripts to Roman.