Lawrence Wolf-Sonkin


2020

pdf bib
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
Brian Roark | Lawrence Wolf-Sonkin | Christo Kirov | Sabrina J. Mielke | Cibu Johny | Isin Demirsahin | Keith Hall
Proceedings of the 12th Language Resources and Evaluation Conference

This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet. We document the methods used for preparation and selection of the Wikipedia text in each language; collection of attested romanizations for sampled lexicons; and manual romanization of held-out sentences from the native script collections. We additionally provide baseline results on several tasks made possible by the dataset, including single word transliteration, full sentence transliteration, and language modeling of native script and romanized text.

2019

pdf bib
Quantifying the Semantic Core of Gender Systems
Adina Williams | Damian Blasi | Lawrence Wolf-Sonkin | Hanna Wallach | Ryan Cotterell
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Many of the world’s languages employ grammatical gender on the lexeme. For instance, in Spanish, house “casa” is feminine, whereas the word for paper “papel” is masculine. To a speaker of a genderless language, this categorization seems to exist with neither rhyme nor reason. But, is the association of nouns to gender classes truly arbitrary? In this work, we present the first large-scale investigation of the arbitrariness of gender assignment that uses canonical correlation analysis as a method for correlating the gender of inanimate nouns with their lexical semantic meaning. We find that the gender systems of 18 languages exhibit a significant correlation with an externally grounded definition of lexical semantics.

pdf bib
Unsupervised Discovery of Gendered Language through Latent-Variable Modeling
Alexander Miserlis Hoyle | Lawrence Wolf-Sonkin | Hanna Wallach | Isabelle Augenstein | Ryan Cotterell
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Studying the ways in which language is gendered has long been an area of interest in sociolinguistics. Studies have explored, for example, the speech of male and female characters in film and the language used to describe male and female politicians. In this paper, we aim not to merely study this phenomenon qualitatively, but instead to quantify the degree to which the language used to describe men and women is different and, moreover, different in a positive or negative way. To that end, we introduce a generative latent-variable model that jointly represents adjective (or verb) choice, with its sentiment, given the natural gender of a head (or dependent) noun. We find that there are significant differences between descriptions of male and female nouns and that these differences align with common gender stereotypes: Positive adjectives used to describe women are more often related to their bodies than adjectives used to describe men.

pdf bib
On the Distribution of Deep Clausal Embeddings: A Large Cross-linguistic Study
Damian Blasi | Ryan Cotterell | Lawrence Wolf-Sonkin | Sabine Stoll | Balthasar Bickel | Marco Baroni
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Embedding a clause inside another (“the girl [who likes cars [that run fast]] has arrived”) is a fundamental resource that has been argued to be a key driver of linguistic expressiveness. As such, it plays a central role in fundamental debates on what makes human language unique, and how they might have evolved. Empirical evidence on the prevalence and the limits of embeddings has however been based on either laboratory setups or corpus data of relatively limited size. We introduce here a collection of large, dependency-parsed written corpora in 17 languages, that allow us, for the first time, to capture clausal embedding through dependency graphs and assess their distribution. Our results indicate that there is no evidence for hard constraints on embedding depth: the tail of depth distributions is heavy. Moreover, although deeply embedded clauses tend to be shorter, suggesting processing load issues, complex sentences with many embeddings do not display a bias towards less deep embeddings. Taken together, the results suggest that deep embeddings are not disfavoured in written language. More generally, our study illustrates how resources and methods from latest-generation big-data NLP can provide new perspectives on fundamental questions in theoretical linguistics.

pdf bib
Latin script keyboards for South Asian languages with finite-state normalization
Lawrence Wolf-Sonkin | Vlad Schogol | Brian Roark | Michael Riley
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing

The use of the Latin script for text entry of South Asian languages is common, even though there is no standard orthography for these languages in the script. We explore several compact finite-state architectures that permit variable spellings of words during mobile text entry. We find that approaches making use of transliteration transducers provide large accuracy improvements over baselines, but that simpler approaches involving a compact representation of many attested alternatives yields much of the accuracy gain. This is particularly important when operating under constraints on model size (e.g., on inexpensive mobile devices with limited storage and memory for keyboard models), and on speed of inference, since people typing on mobile keyboards expect no perceptual delay in keyboard responsiveness.

pdf bib
The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection
Arya D. McCarthy | Ekaterina Vylomova | Shijie Wu | Chaitanya Malaviya | Lawrence Wolf-Sonkin | Garrett Nicolai | Christo Kirov | Miikka Silfverberg | Sabrina J. Mielke | Jeffrey Heinz | Ryan Cotterell | Mans Hulden
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages. The first task evolves past years’ inflection tasks by examining transfer of morphological inflection knowledge from a high-resource language to a low-resource language. This year also presents a new second challenge on lemmatization and morphological feature analysis in context. All submissions featured a neural component and built on either this year’s strong baselines or highly ranked systems from previous years’ shared tasks. Every participating team improved in accuracy over the baselines for the inflection task (though not Levenshtein distance), and every team in the contextual analysis task improved on both state-of-the-art neural and non-neural baselines.

pdf bib
Combining Sentiment Lexica with a Multi-View Variational Autoencoder
Alexander Miserlis Hoyle | Lawrence Wolf-Sonkin | Hanna Wallach | Ryan Cotterell | Isabelle Augenstein
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

When assigning quantitative labels to a dataset, different methodologies may rely on different scales. In particular, when assigning polarities to words in a sentiment lexicon, annotators may use binary, categorical, or continuous labels. Naturally, it is of interest to unify these labels from disparate scales to both achieve maximal coverage over words and to create a single, more robust sentiment lexicon while retaining scale coherence. We introduce a generative model of sentiment lexica to combine disparate scales into a common latent representation. We realize this model with a novel multi-view variational autoencoder (VAE), called SentiVAE. We evaluate our approach via a downstream text classification task involving nine English-Language sentiment analysis datasets; our representation outperforms six individual sentiment lexica, as well as a straightforward combination thereof.

2018

pdf bib
A Structured Variational Autoencoder for Contextual Morphological Inflection
Lawrence Wolf-Sonkin | Jason Naradowsky | Sabrina J. Mielke | Ryan Cotterell
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Statistical morphological inflectors are typically trained on fully supervised, type-level data. One remaining open research question is the following: How can we effectively exploit raw, token-level data to improve their performance? To this end, we introduce a novel generative latent-variable model for the semi-supervised learning of inflection generation. To enable posterior inference over the latent variables, we derive an efficient variational inference procedure based on the wake-sleep algorithm. We experiment on 23 languages, using the Universal Dependencies corpora in a simulated low-resource setting, and find improvements of over 10% absolute accuracy in some cases.