Lieke Gelderloos


2020

pdf bib
Learning to Understand Child-directed and Adult-directed Speech
Lieke Gelderloos | Grzegorz Chrupała | Afra Alishahi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Speech directed to children differs from adult-directed speech in linguistic aspects such as repetition, word choice, and sentence length, as well as in aspects of the speech signal itself, such as prosodic and phonemic variation. Human language acquisition research indicates that child-directed speech helps language learners. This study explores the effect of child-directed speech when learning to extract semantic information from speech directly. We compare the task performance of models trained on adult-directed speech (ADS) and child-directed speech (CDS). We find indications that CDS helps in the initial stages of learning, but eventually, models trained on ADS reach comparable task performance, and generalize better. The results suggest that this is at least partially due to linguistic rather than acoustic properties of the two registers, as we see the same pattern when looking at models trained on acoustically comparable synthetic speech.

2019

pdf bib
The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue
Janosch Haber | Tim Baumgärtner | Ece Takmaz | Lieke Gelderloos | Elia Bruni | Raquel Fernández
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

This paper introduces the PhotoBook dataset, a large-scale collection of visually-grounded, task-oriented dialogues in English designed to investigate shared dialogue history accumulating during conversation. Taking inspiration from seminal work on dialogue analysis, we propose a data-collection task formulated as a collaborative game prompting two online participants to refer to images utilising both their visual context as well as previously established referring expressions. We provide a detailed description of the task setup and a thorough analysis of the 2,500 dialogues collected. To further illustrate the novel features of the dataset, we propose a baseline model for reference resolution which uses a simple method to take into account shared information accumulated in a reference chain. Our results show that this information is particularly important to resolve later descriptions and underline the need to develop more sophisticated models of common ground in dialogue interaction.

pdf bib
On the difficulty of a distributional semantics of spoken language
Grzegorz Chrupała | Lieke Gelderloos | Ákos Kádár | Afra Alishahi
Proceedings of the Society for Computation in Linguistics (SCiL) 2019

2017

pdf bib
Representations of language in a model of visually grounded speech signal
Grzegorz Chrupała | Lieke Gelderloos | Afra Alishahi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a visually grounded model of speech perception which projects spoken utterances and images to a joint semantic space. We use a multi-layer recurrent highway network to model the temporal nature of spoken speech, and show that it learns to extract both form and meaning-based linguistic knowledge from the input signal. We carry out an in-depth analysis of the representations used by different components of the trained model and show that encoding of semantic aspects tends to become richer as we go up the hierarchy of layers, whereas encoding of form-related aspects of the language input tends to initially increase and then plateau or decrease.

2016

pdf bib
From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning
Lieke Gelderloos | Grzegorz Chrupała
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We present a model of visually-grounded language learning based on stacked gated recurrent neural networks which learns to predict visual features given an image description in the form of a sequence of phonemes. The learning task resembles that faced by human language learners who need to discover both structure and meaning from noisy and ambiguous data across modalities. We show that our model indeed learns to predict features of the visual context given phonetically transcribed image descriptions, and show that it represents linguistic information in a hierarchy of levels: lower layers in the stack are comparatively more sensitive to form, whereas higher layers are more sensitive to meaning.