Jonathan Dunn


2020

pdf bib
Geographically-Balanced Gigaword Corpora for 50 Language Varieties
Jonathan Dunn | Ben Adams
Proceedings of the 12th Language Resources and Evaluation Conference

While text corpora have been steadily increasing in overall size, even very large corpora are not designed to represent global population demographics. For example, recent work has shown that existing English gigaword corpora over-represent inner-circle varieties from the US and the UK. To correct implicit geographic and demographic biases, this paper uses country-level population demographics to guide the construction of gigaword web corpora. The resulting corpora explicitly match the ground-truth geographic distribution of each language, thus equally representing language users from around the world. This is important because it ensures that speakers of under-resourced language varieties (i.e., Indian English or Algerian French) are represented, both in the corpora themselves but also in derivative resources like word embeddings.

pdf bib
Measuring Linguistic Diversity During COVID-19
Jonathan Dunn | Tom Coupe | Benjamin Adams
Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science

Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data. The contribution of this paper is to calibrate measures of linguistic diversity using restrictions on international travel resulting from the COVID-19 pandemic. Previous work has mapped the distribution of languages using geo-referenced social media and web data. The goal, however, has been to describe these corpora themselves rather than to make inferences about underlying populations. This paper shows that a difference-in-differences method based on the Herfindahl-Hirschman Index can identify the bias in digital corpora that is introduced by non-local populations. These methods tell us where significant changes have taken place and whether this leads to increased or decreased diversity. This is an important step in aligning digital corpora like social media with the real-world populations that have produced them.

2019

pdf bib
Modeling Global Syntactic Variation in English Using Dialect Classification
Jonathan Dunn
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper evaluates global-scale dialect identification for 14 national varieties of English on both web-crawled data and Twitter data. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set of syntactic features using grammar induction rather than focusing on a few hand-selected features such as function words; and (iii) comparing models across both web corpora and social media corpora in order to measure the robustness of syntactic variation across registers.

pdf bib
Frequency vs. Association for Constraint Selection in Usage-Based Construction Grammar
Jonathan Dunn
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

A usage-based Construction Grammar (CxG) posits that slot-constraints generalize from common exemplar constructions. But what is the best model of constraint generalization? This paper evaluates competing frequency-based and association-based models across eight languages using a metric derived from the Minimum Description Length paradigm. The experiments show that association-based models produce better generalizations across all languages by a significant margin.

2018

pdf bib
Modeling the Complexity and Descriptive Adequacy of Construction Grammars
Jonathan Dunn
Proceedings of the Society for Computation in Linguistics (SCiL) 2018

2014

pdf bib
Measuring metaphoricity
Jonathan Dunn
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Multi-dimensional abstractness in cross-domain mappings
Jonathan Dunn
Proceedings of the Second Workshop on Metaphor in NLP

2013

pdf bib
What metaphor identification systems can tell us about metaphor-in-language
Jonathan Dunn
Proceedings of the First Workshop on Metaphor in NLP