Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks. We introduce GoEmotions, the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral. We demonstrate the high quality of the annotations via Principal Preserved Component Analysis. We conduct transfer learning experiments with existing emotion benchmarks to show that our dataset generalizes well to other domains and different emotion taxonomies. Our BERT-based model achieves an average F1-score of .46 across our proposed taxonomy, leaving much room for improvement.
In this paper, we present Pártélet, a digitized Hungarian corpus of Communist propaganda texts. Pártélet was the official journal of the governing party during the Hungarian socialism from 1956 to 1989, hence it represents the direct political agitation and propaganda of the dictatorial system in question. The paper has a dual purpose: first, to present a general review of the corpus compilation process and the basic statistical data of the corpus, and second, to demonstrate through two case studies what the dataset can be used for. We show that our corpus provides a unique opportunity for conducting research on Hungarian propaganda discourse, as well as analyzing changes of this discourse over a 35-year period of time with computer-assisted methods.
In this study, we apply NLP methods to learn about the framing of the 2020 Democratic Presidential candidates in news media. We use both a lexicon-based approach and word embeddings to analyze how candidates are discussed in news sources with different political leanings. Our results show significant differences in the framing of candidates across the news sources along several dimensions, such as sentiment and agency, paving the way for a deeper investigation.
We provide an NLP framework to uncover four linguistic dimensions of political polarization in social media: topic choice, framing, affect and illocutionary force. We quantify these aspects with existing lexical methods, and propose clustering of tweet embeddings as a means to identify salient topics for analysis across events; human evaluations show that our approach generates more cohesive topics than traditional LDA-based models. We apply our methods to study 4.4M tweets on 21 mass shootings. We provide evidence that the discussion of these events is highly polarized politically and that this polarization is primarily driven by partisan differences in framing rather than topic choice. We identify framing devices, such as grounding and the contrasting use of the terms “terrorist” and “crazy”, that contribute to polarization. Results pertaining to topic choice, affect and illocutionary force suggest that Republicans focus more on the shooter and event-specific facts (news) while Democrats focus more on the victims and call for policy changes. Our work contributes to a deeper understanding of the way group divisions manifest in language and to computational methods for studying them.