Cornelia Caragea


pdf bib
Detecting Perceived Emotions in Hurricane Disasters
Shrey Desai | Cornelia Caragea | Junyi Jessy Li
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Natural disasters (e.g., hurricanes) affect millions of people each year, causing widespread destruction in their wake. People have recently taken to social media websites (e.g., Twitter) to share their sentiments and feelings with the larger community. Consequently, these platforms have become instrumental in understanding and perceiving emotions at scale. In this paper, we introduce HurricaneEmo, an emotion dataset of 15,000 English tweets spanning three hurricanes: Harvey, Irma, and Maria. We present a comprehensive study of fine-grained emotions and propose classification tasks to discriminate between coarse-grained emotion groups. Our best BERT model, even after task-guided pre-training which leverages unlabeled Twitter data, achieves only 68% accuracy (averaged across all groups). HurricaneEmo serves not only as a challenging benchmark for models but also as a valuable resource for analyzing emotions in disaster-centric domains.

pdf bib
Cross-Lingual Disaster-related Multi-label Tweet Classification with Manifold Mixup
Jishnu Ray Chowdhury | Cornelia Caragea | Doina Caragea
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Distinguishing informative and actionable messages from a social media platform like Twitter is critical for facilitating disaster management. For this purpose, we compile a multilingual dataset of over 130K samples for multi-label classification of disaster-related tweets. We present a masking-based loss function for partially labelled samples and demonstrate the effectiveness of Manifold Mixup in the text domain. Our main model is based on Multilingual BERT, which we further improve with Manifold Mixup. We show that our model generalizes to unseen disasters in the test set. Furthermore, we analyze the capability of our model for zero-shot generalization to new languages. Our code, dataset, and other resources are available on Github.

pdf bib
Dynamic Classification in Web Archiving Collections
Krutarth Patel | Cornelia Caragea | Mark Phillips
Proceedings of the 12th Language Resources and Evaluation Conference

The Web archived data usually contains high-quality documents that are very useful for creating specialized collections of documents. To create such collections, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection out of the large collections (of millions in size) from Web Archiving institutions. However, the patterns of the documents of interest can differ substantially from one document to another, which makes the automatic classification task very challenging. In this paper, we explore dynamic fusion models to find, on the fly, the model or combination of models that performs best on a variety of document types. Our experimental results show that the approach that fuses different models outperforms individual models and other ensemble methods on three datasets.

pdf bib
Scientific Keyphrase Identification and Classification by Pre-Trained Language Models Intermediate Task Transfer Learning
Seoyeon Park | Cornelia Caragea
Proceedings of the 28th International Conference on Computational Linguistics

Scientific keyphrase identification and classification is the task of detecting and classifying keyphrases from scholarly text with their types from a set of predefined classes. This task has a wide range of benefits, but it is still challenging in performance due to the lack of large amounts of labeled data required for training deep neural models. In order to overcome this challenge, we explore pre-trained language models BERT and SciBERT with intermediate task transfer learning, using 42 data-rich related intermediate-target task combinations. We reveal that intermediate task transfer learning on SciBERT induces a better starting point for target task fine-tuning compared with BERT and achieves competitive performance in scientific keyphrase identification and classification compared to both previous works and strong baselines. Interestingly, we observe that BERT with intermediate task transfer learning fails to improve the performance of scientific keyphrase identification and classification potentially due to significant catastrophic forgetting. This result highlights that scientific knowledge achieved during the pre-training of language models on large scientific collections plays an important role in the target tasks. We also observe that sequence tagging related intermediate tasks, especially syntactic structure learning tasks such as POS Tagging, tend to work best for scientific keyphrase identification and classification.

pdf bib
On the Use of Web Search to Improve Scientific Collections
Krutarth Patel | Cornelia Caragea | Sujatha Das Gollapalli
Proceedings of the First Workshop on Scholarly Document Processing

Despite the advancements in search engine features, ranking methods, technologies, and the availability of programmable APIs, current-day open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collections. In this paper, we propose a novel search-driven framework for acquiring documents for such scientific portals. Within our framework, publicly-available research paper titles and author names are used as queries to a Web search engine. We were able to obtain ~267,000 unique research papers through our fully-automated framework using ~76,000 queries, resulting in almost 200,000 more papers than the number of queries. Moreover, through a combination of title and author name search, we were able to recover 78% of the original searched titles.

pdf bib
CancerEmo: A Dataset for Fine-Grained Emotion Detection
Tiberiu Sosea | Cornelia Caragea
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Emotions are an important element of human nature, often affecting the overall wellbeing of a person. Therefore, it is no surprise that the health domain is a valuable area of interest for emotion detection, as it can provide medical staff or caregivers with essential information about patients. However, progress on this task has been hampered by the absence of large labeled datasets. To this end, we introduce CancerEmo, an emotion dataset created from an online health community and annotated with eight fine-grained emotions. We perform a comprehensive analysis of these emotions and develop deep learning models on the newly created dataset. Our best BERT model achieves an average F1 of 71%, which we improve further using domain-specific pre-training.


pdf bib
The Myth of Double-Blind Review Revisited: ACL vs. EMNLP
Cornelia Caragea | Ana Uban | Liviu P. Dinu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The review and selection process for scientific paper publication is essential for the quality of scholarly publications in a scientific field. The double-blind review system, which enforces author anonymity during the review period, is widely used by prestigious conferences and journals to ensure the integrity of this process. Although the notion of anonymity in the double-blind review has been questioned before, the availability of full text paper collections brings new opportunities for exploring the question: Is the double-blind review process really double-blind? We study this question on the ACL and EMNLP paper collections and present an analysis on how well deep learning techniques can infer the authors of a paper. Specifically, we explore Convolutional Neural Networks trained on various aspects of a paper, e.g., content, style features, and references, to understand the extent to which we can infer the authors of a paper and what aspects contribute the most. Our results show that the authors of a paper can be inferred with accuracy as high as 87% on ACL and 78% on EMNLP for the top 100 most prolific authors.

pdf bib
Multi-Task Stance Detection with Sentiment and Stance Lexicons
Yingjie Li | Cornelia Caragea
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Stance detection aims to detect whether the opinion holder is in support of or against a given target. Recent works show improvements in stance detection by using either the attention mechanism or sentiment information. In this paper, we propose a multi-task framework that incorporates target-specific attention mechanism and at the same time takes sentiment classification as an auxiliary task. Moreover, we used a sentiment lexicon and constructed a stance lexicon to provide guidance for the attention layer. Experimental results show that the proposed model significantly outperforms state-of-the-art deep learning methods on the SemEval-2016 dataset.


pdf bib
Exploring Optimism and Pessimism in Twitter Using Deep Learning
Cornelia Caragea | Liviu P. Dinu | Bogdan Dumitru
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Identifying optimistic and pessimistic viewpoints and users from Twitter is useful for providing better social support to those who need such support, and for minimizing the negative influence among users and maximizing the spread of positive attitudes and ideas. In this paper, we explore a range of deep learning models to predict optimism and pessimism in Twitter at both tweet and user level and show that these models substantially outperform traditional machine learning classifiers used in prior work. In addition, we show evidence that a sentiment classifier would not be sufficient for accurately predicting optimism and pessimism in Twitter. Last, we study the verb tense usage as well as the presence of polarity words in optimistic and pessimistic tweets.

pdf bib
Fine-Grained Emotion Detection in Health-Related Online Posts
Hamed Khanpour | Cornelia Caragea
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Detecting fine-grained emotions in online health communities provides insightful information about patients’ emotional states. However, current computational approaches to emotion detection from health-related posts focus only on identifying messages that contain emotions, with no emphasis on the emotion type, using a set of handcrafted features. In this paper, we take a step further and propose to detect fine-grained emotion types from health-related posts and show how high-level and abstract features derived from deep neural networks combined with lexicon-based features can be employed to detect emotions.


pdf bib
PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents
Corina Florescu | Cornelia Caragea
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The large and growing amounts of online scholarly data present both challenges and opportunities to enhance knowledge discovery. One such challenge is to automatically extract a small set of keyphrases from a document that can accurately describe the document’s content and can facilitate fast information processing. In this paper, we propose PositionRank, an unsupervised model for keyphrase extraction from scholarly documents that incorporates information from all positions of a word’s occurrences into a biased PageRank. Our model obtains remarkable improvements in performance over PageRank models that do not take into account word positions as well as over strong baselines for this task. Specifically, on several datasets of research papers, PositionRank achieves improvements as high as 29.09%.

pdf bib
Identifying Empathetic Messages in Online Health Communities
Hamed Khanpour | Cornelia Caragea | Prakhar Biyani
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Empathy captures one’s ability to correlate with and understand others’ emotional states and experiences. Messages with empathetic content are considered as one of the main advantages for joining online health communities due to their potential to improve people’s moods. Unfortunately, to this date, no computational studies exist that automatically identify empathetic messages in online health communities. We propose a combination of Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM) networks, and show that the proposed model outperforms each individual model (CNN and LSTM) as well as several baselines.


pdf bib
Supervised Keyphrase Extraction as Positive Unlabeled Learning
Lucas Sterckx | Cornelia Caragea | Thomas Demeester | Chris Develder
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing


pdf bib
Co-Training for Topic Classification of Scholarly Data
Cornelia Caragea | Florin Bulgarov | Rada Mihalcea
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction
Sujatha Das Gollapalli | Cornelia Caragea | Xiaoli Li | C. Lee Giles
Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction


pdf bib
Identifying Emotional and Informational Support in Online Health Communities
Prakhar Biyani | Cornelia Caragea | Prasenjit Mitra | John Yen
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach
Cornelia Caragea | Florin Adrian Bulgarov | Andreea Godea | Sujatha Das Gollapalli
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)


pdf bib
Thread Specific Features are Helpful for Identifying Subjectivity Orientation of Online Forum Threads
Prakhar Biyani | Sumit Bhatia | Cornelia Caragea | Prasenjit Mitra
Proceedings of COLING 2012