Lipika Dey


2020

pdf bib
Extracting Semantic Aspects for Structured Representation of Clinical Trial Eligibility Criteria
Tirthankar Dasgupta | Ishani Mondal | Abir Naskar | Lipika Dey
Proceedings of the 3rd Clinical Natural Language Processing Workshop

Eligibility criteria in the clinical trials specify the characteristics that a patient must or must not possess in order to be treated according to a standard clinical care guideline. As the process of manual eligibility determination is time-consuming, automatic structuring of the eligibility criteria into various semantic categories or aspects is the need of the hour. Existing methods use hand-crafted rules and feature-based statistical machine learning methods to dynamically induce semantic aspects. However, in order to deal with paucity of aspect-annotated clinical trials data, we propose a novel weakly-supervised co-training based method which can exploit a large pool of unlabeled criteria sentences to augment the limited supervised training data, and consequently enhance the performance. Experiments with 0.2M criteria sentences show that the proposed approach outperforms the competitive supervised baselines by 12% in terms of micro-averaged F1 score for all the aspects. Probing deeper into analysis, we observe domain-specific information boosts up the performance by a significant margin.

pdf bib
Learning Domain Terms - Empirical Methods to Enhance Enterprise Text Analytics Performance
Gargi Roy | Lipika Dey | Mohammad Shakir | Tirthankar Dasgupta
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

Performance of standard text analytics algorithms are known to be substantially degraded on consumer generated data, which are often very noisy. These algorithms also do not work well on enterprise data which has a very different nature from News repositories, storybooks or Wikipedia data. Text cleaning is a mandatory step which aims at noise removal and correction to improve performance. However, enterprise data need special cleaning methods since it contains many domain terms which appear to be noise against a standard dictionary, but in reality are not so. In this work we present detailed analysis of characteristics of enterprise data and suggest unsupervised methods for cleaning these repositories after domain terms have been automatically segregated from true noise terms. Noise terms are thereafter corrected in a contextual fashion. The effectiveness of the method is established through careful manual evaluation of error corrections over several standard data sets, including those available for hate speech detection, where there is deliberate distortion to avoid detection. We also share results to show enhancement in classification accuracy after noise correction.

pdf bib
Identifying pandemic-related stress factors from social-media posts – Effects on students and young-adults
Sachin Thukral | Suyash Sangwan | Arnab Chatterjee | Lipika Dey
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

The COVID-19 pandemic has thrown natural life out of gear across the globe. Strict measures are deployed to curb the spread of the virus that is causing it, and the most effective of them have been social isolation. This has led to wide-spread gloom and depression across society but more so among the young and the elderly. There are currently more than 200 million college students in 186 countries worldwide, affected due to the pandemic. The mode of education has changed suddenly, with the rapid adaptation of e-learning, whereby teaching is undertaken remotely and on digital platforms. This study presents insights gathered from social media posts that were posted by students and young adults during the COVID times. Using statistical and NLP techniques, we analyzed the behavioural issues reported by users themselves in their posts in depression related communities on Reddit. We present methodologies to systematically analyze content using linguistic techniques to find out the stress-inducing factors. Online education, losing jobs, isolation from friends and abusive families emerge as key stress factors

2018

pdf bib
TCS Research at SemEval-2018 Task 1: Learning Robust Representations using Multi-Attention Architecture
Hardik Meisheri | Lipika Dey
Proceedings of The 12th International Workshop on Semantic Evaluation

This paper presents system description of our submission to the SemEval-2018 task-1: Affect in tweets for the English language. We combine three different features generated using deep learning models and traditional methods in support vector machines to create a unified ensemble system. A robust representation of a tweet is learned using a multi-attention based architecture which uses a mixture of different pre-trained embeddings. In addition to this analysis of different features is also presented. Our system ranked 2nd, 5th, and 7th in different subtasks among 75 teams.

pdf bib
Automatic Curation and Visualization of Crime Related Information from Incrementally Crawled Multi-source News Reports
Tirthankar Dasgupta | Lipika Dey | Rupsa Saha | Abir Naskar
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

In this paper, we demonstrate a system for the automatic extraction and curation of crime-related information from multi-source digitally published News articles collected over a period of five years. We have leveraged the use of deep convolution recurrent neural network model to analyze crime articles to extract different crime related entities and events. The proposed methods are not restricted to detecting known crimes only but contribute actively towards maintaining an updated crime ontology. We have done experiments with a collection of 5000 crime-reporting News articles span over time, and multiple sources. The end-product of our experiments is a crime-register that contains details of crime committed across geographies and time. This register can be further utilized for analytical and reporting purposes.

pdf bib
Augmenting Textual Qualitative Features in Deep Convolution Recurrent Neural Network for Automatic Essay Scoring
Tirthankar Dasgupta | Abir Naskar | Lipika Dey | Rupsa Saha
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications

In this paper we present a qualitatively enhanced deep convolution recurrent neural network for computing the quality of a text in an automatic essay scoring task. The novelty of the work lies in the fact that instead of considering only the word and sentence representation of a text, we try to augment the different complex linguistic, cognitive and psycological features associated within a text document along with a hierarchical convolution recurrent neural network framework. Our preliminary investigation shows that incorporation of such qualitative feature vectors along with standard word/sentence embeddings can give us better understanding about improving the overall evaluation of the input essays.

pdf bib
Automatic Extraction of Causal Relations from Text using Linguistically Informed Deep Neural Networks
Tirthankar Dasgupta | Rupsa Saha | Lipika Dey | Abir Naskar
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue

In this paper we have proposed a linguistically informed recursive neural network architecture for automatic extraction of cause-effect relations from text. These relations can be expressed in arbitrarily complex ways. The architecture uses word level embeddings and other linguistic features to detect causal events and their effects mentioned within a sentence. The extracted events and their relations are used to build a causal-graph after clustering and appropriate generalization, which is then used for predictive purposes. We have evaluated the performance of the proposed extraction model with respect to two baseline systems,one a rule-based classifier, and the other a conditional random field (CRF) based supervised model. We have also compared our results with related work reported in the past by other authors on SEMEVAL data set, and found that the proposed bi-directional LSTM model enhanced with an additional linguistic layer performs better. We have also worked extensively on creating new annotated datasets from publicly available data, which we are willing to share with the community.

pdf bib
Leveraging Web Based Evidence Gathering for Drug Information Identification from Tweets
Rupsa Saha | Abir Naskar | Tirthankar Dasgupta | Lipika Dey
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

In this paper, we have explored web-based evidence gathering and different linguistic features to automatically extract drug names from tweets and further classify such tweets into Adverse Drug Events or not. We have evaluated our proposed models with the dataset as released by the SMM4H workshop shared Task-1 and Task-3 respectively. Our evaluation results shows that the proposed model achieved good results, with Precision, Recall and F-scores of 78.5%, 88% and 82.9% respectively for Task1 and 33.2%, 54.7% and 41.3% for Task3.

2017

pdf bib
Textmining at EmoInt-2017: A Deep Learning Approach to Sentiment Intensity Scoring of English Tweets
Hardik Meisheri | Rupsa Saha | Priyanka Sinha | Lipika Dey
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

This paper describes our approach to the Emotion Intensity shared task. A parallel architecture of Convolutional Neural Network (CNN) and Long short term memory networks (LSTM) alongwith two sets of features are extracted which aid the network in judging emotion intensity. Experiments on different models and various features sets are described and analysis on results has also been presented.

2016

pdf bib
A Framework for Mining Enterprise Risk and Risk Factors from News Documents
Tirthankar Dasgupta | Lipika Dey | Prasenjit Dey | Rupsa Saha
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

Any real world events or trends that can affect the company’s growth trajectory can be considered as risk. There has been a growing need to automatically identify, extract and analyze risk related statements from news events. In this demonstration, we will present a risk analytics framework that processes enterprise project management reports in the form of textual data and news documents and classify them into valid and invalid risk categories. The framework also extracts information from the text pertaining to the different categories of risks like their possible cause and impacts. Accordingly, we have used machine learning based techniques and studied different linguistic features like n-gram, POS, dependency, future timing, uncertainty factors in texts and their various combinations. A manual annotation study from management experts using risk descriptions collected for a specific organization was conducted to evaluate the framework. The evaluation showed promising results for automated risk analysis and identification.

2015

pdf bib
Mining HEXACO personality traits from Enterprise Social Media
Priyanka Sinha | Lipika Dey | Pabitra Mitra | Anupam Basu
Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis