Dong Nguyen


pdf bib
tBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection
Nicole Peinelt | Dong Nguyen | Maria Liakata
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Semantic similarity detection is a fundamental task in natural language understanding. Adding topic information has been useful for previous feature-engineered semantic similarity models as well as neural models for other tasks. There is currently no standard way of combining topics with pretrained contextual representations such as BERT. We propose a novel topic-informed BERT-based architecture for pairwise semantic similarity detection and show that our model improves performance over strong neural baselines across a variety of English language datasets. We find that the addition of topics to BERT helps particularly with resolving domain-specific cases.

pdf bib
Do Word Embeddings Capture Spelling Variation?
Dong Nguyen | Jack Grieve
Proceedings of the 28th International Conference on Computational Linguistics

Analyses of word embeddings have primarily focused on semantic and syntactic properties. However, word embeddings have the potential to encode other properties as well. In this paper, we propose a new perspective on the analysis of word embeddings by focusing on spelling variation. In social media, spelling variation is abundant and often socially meaningful. Here, we analyze word embeddings trained on Twitter and Reddit data. We present three analyses using pairs of word forms covering seven types of spelling variation in English. Taken together, our results show that word embeddings encode spelling variation patterns of various types to some extent, even embeddings trained using the skipgram model which does not take spelling into account. Our results also suggest a link between the intentionality of the variation and the distance of the non-conventional spellings to their conventional spellings.


pdf bib
Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings
Philippa Shoemark | Farhana Ferdousi Liza | Dong Nguyen | Scott Hale | Barbara McGillivray
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Word embeddings are increasingly used for the automatic detection of semantic change; yet, a robust evaluation and systematic comparison of the choices involved has been lacking. We propose a new evaluation framework for semantic change detection and find that (i) using the whole time series is preferable over only comparing between the first and last time points; (ii) independently trained and aligned embeddings perform better than continuously trained embeddings for long time periods; and (iii) that the reference point for comparison matters. We also present an analysis of the changes detected on a large Twitter dataset spanning 5.5 years.

pdf bib
Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets
Nicole Peinelt | Maria Liakata | Dong Nguyen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Existing datasets for scoring text pairs in terms of semantic similarity contain instances whose resolution differs according to the degree of difficulty. This paper proposes to distinguish obvious from non-obvious text pairs based on superficial lexical overlap and ground-truth labels. We characterise existing datasets in terms of containing difficult cases and find that recently proposed models struggle to capture the non-obvious cases of semantic similarity. We describe metrics that emphasise cases of similarity which require more complex inference and propose that these are used for evaluating systems for semantic similarity.

pdf bib
Challenges and frontiers in abusive content detection
Bertie Vidgen | Alex Harris | Dong Nguyen | Rebekah Tromble | Scott Hale | Helen Margetts
Proceedings of the Third Workshop on Abusive Language Online

Online abusive content detection is an inherently difficult task. It has received considerable attention from academia, particularly within the computational linguistics community, and performance appears to have improved as the field has matured. However, considerable challenges and unaddressed frontiers remain, spanning technical, social and ethical dimensions. These issues constrain the performance, efficiency and generalizability of abusive content detection systems. In this article we delineate and clarify the main challenges and frontiers in the field, critically evaluate their implications and discuss potential solutions. We also highlight ways in which social scientific insights can advance research. We discuss the lack of support given to researchers working with abusive content and provide guidelines for ethical research.


pdf bib
Comparing Automatic and Human Evaluation of Local Explanations for Text Classification
Dong Nguyen
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Text classification models are becoming increasingly complex and opaque, however for many applications it is essential that the models are interpretable. Recently, a variety of approaches have been proposed for generating local explanations. While robust evaluations are needed to drive further progress, so far it is unclear which evaluation approaches are suitable. This paper is a first step towards more robust evaluations of local explanations. We evaluate a variety of local explanation approaches using automatic measures based on word deletion. Furthermore, we show that an evaluation using a crowdsourcing experiment correlates moderately with these automatic measures and that a variety of other factors also impact the human judgements.


pdf bib
A Kernel Independence Test for Geographical Language Variation
Dong Nguyen | Jacob Eisenstein
Computational Linguistics, Volume 43, Issue 3 - September 2017

Quantifying the degree of spatial dependence for linguistic variables is a key task for analyzing dialectal variation. However, existing approaches have important drawbacks. First, they are based on parametric models of dependence, which limits their power in cases where the underlying parametric assumptions are violated. Second, they are not applicable to all types of linguistic data: Some approaches apply only to frequencies, others to boolean indicators of whether a linguistic variable is present. We present a new method for measuring geographical language variation, which solves both of these problems. Our approach builds on Reproducing Kernel Hilbert Space (RKHS) representations for nonparametric statistics, and takes the form of a test statistic that is computed from pairs of individual geotagged observations without aggregation into predefined geographical bins. We compare this test with prior work using synthetic data as well as a diverse set of real data sets: a corpus of Dutch tweets, a Dutch syntactic atlas, and a data set of letters to the editor in North American newspapers. Our proposed test is shown to support robust inferences across a broad range of scenarios and types of data.


pdf bib
Automatic Detection of Intra-Word Code-Switching
Dong Nguyen | Leonie Cornips
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
Survey: Computational Sociolinguistics: A Survey
Dong Nguyen | A. Seza Doğruöz | Carolyn P. Rosé | Franciska de Jong
Computational Linguistics, Volume 42, Issue 3 - September 2016


pdf bib
#SupportTheCause: Identifying Motivations to Participate in Online Health Campaigns
Dong Nguyen | Tijs van den Broek | Claudia Hauff | Djoerd Hiemstra | Michel Ehrenhard
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
On the Impact of Twitter-based Health Campaigns: A Cross-Country Analysis of Movember
Nugroho Dwi Prasetyo | Claudia Hauff | Dong Nguyen | Tijs van den Broek | Djoerd Hiemstra
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis


pdf bib
Predicting Code-switching in Multilingual Communication for Immigrant Communities
Evangelos Papalexakis | Dong Nguyen | A. Seza Doğruöz
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment
Dong Nguyen | Dolf Trieschnigg | A. Seza Doğruöz | Rilana Gravel | Mariët Theune | Theo Meder | Franciska de Jong
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
TweetGenie: Development, Evaluation, and Lessons Learned
Dong Nguyen | Dolf Trieschnigg | Theo Meder
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations


pdf bib
Word Level Language Identification in Online Multilingual Communication
Dong Nguyen | A. Seza Doğruöz
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Learning to Extract Folktale Keywords
Dolf Trieschnigg | Dong Nguyen | Mariët Theune
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities


pdf bib
Language use as a reflection of socialization in online communities
Dong Nguyen | Carolyn P. Rosé
Proceedings of the Workshop on Language in Social Media (LSM 2011)

pdf bib
Author Age Prediction from Text using Linear Regression
Dong Nguyen | Noah A. Smith | Carolyn P. Rosé
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities