Desmond Elliott


2020

pdf bib
The Sensitivity of Language Models and Humans to Winograd Schema Perturbations
Mostafa Abdou | Vinit Ravishankar | Maria Barrett | Yonatan Belinkov | Desmond Elliott | Anders Søgaard
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Large-scale pretrained language models are the major driving force behind recent improvements in perfromance on the Winograd Schema Challenge, a widely employed test of commonsense reasoning ability. We show, however, with a new diagnostic dataset, that these models are sensitive to linguistic perturbations of the Winograd examples that minimally affect human understanding. Our results highlight interesting differences between humans and language models: language models are more sensitive to number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones.

pdf bib
CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning
Alessandro Suglia | Ioannis Konstas | Andrea Vanzo | Emanuele Bastianelli | Desmond Elliott | Stella Frank | Oliver Lemon
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Approaches to Grounded Language Learning are commonly focused on a single task-based final performance measure which may not depend on desirable properties of the learned hidden representations, such as their ability to predict object attributes or generalize to unseen situations. To remedy this, we present GroLLA, an evaluation framework for Grounded Language Learning with Attributes based on three sub-tasks: 1) Goal-oriented evaluation; 2) Object attribute prediction evaluation; and 3) Zero-shot evaluation. We also propose a new dataset CompGuessWhat?! as an instance of this framework for evaluating the quality of learned neural representations, in particular with respect to attribute grounding. To this end, we extend the original GuessWhat?! dataset by including a semantic layer on top of the perceptual one. Specifically, we enrich the VisualGenome scene graphs associated with the GuessWhat?! images with several attributes from resources such as VISA and ImSitu. We then compare several hidden state representations from current state-of-the-art approaches to Grounded Language Learning. By using diagnostic classifiers, we show that current models’ learned representations are not expressive enough to encode object attributes (average F1 of 44.27). In addition, they do not learn strategies nor representations that are robust enough to perform well when novel scenes or objects are involved in gameplay (zero-shot best accuracy 50.06%).

pdf bib
On Forgetting to Cite Older Papers: An Analysis of the ACL Anthology
Marcel Bollmann | Desmond Elliott
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The field of natural language processing is experiencing a period of unprecedented growth, and with it a surge of published papers. This represents an opportunity for us to take stock of how we cite the work of other researchers, and whether this growth comes at the expense of “forgetting” about older literature. In this paper, we address this question through bibliographic analysis. By looking at the age of outgoing citations in papers published at selected ACL venues between 2010 and 2019, we find that there is indeed a tendency for recent papers to cite more recent work, but the rate at which papers older than 15 years are cited has remained relatively stable.

pdf bib
Fine-Grained Grounding for Multimodal Speech Recognition
Tejas Srinivasan | Ramon Sanabria | Florian Metze | Desmond Elliott
Findings of the Association for Computational Linguistics: EMNLP 2020

Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model’s ability to localize the correct proposals.

pdf bib
Textual Supervision for Visually Grounded Spoken Language Understanding
Bertrand Higy | Desmond Elliott | Grzegorz Chrupała
Findings of the Association for Computational Linguistics: EMNLP 2020

Visually-grounded models of spoken language understanding extract semantic information directly from speech, without relying on transcriptions. This is useful for low-resource languages, where transcriptions can be expensive or impossible to obtain. Recent work showed that these models can be improved if transcriptions are available at training time. However, it is not clear how an end-to-end approach compares to a traditional pipeline-based approach when one has access to transcriptions. Comparing different strategies, we find that the pipeline approach works better when enough text is available. With low-resource languages in mind, we also show that translations can be effectively used in place of transcriptions but more data is needed to obtain similar results.

pdf bib
Multimodal Speech Recognition with Unstructured Audio Masking
Tejas Srinivasan | Ramon Sanabria | Florian Metze | Desmond Elliott
Proceedings of the First International Workshop on Natural Language Processing Beyond Text

Visual context has been shown to be useful for automatic speech recognition (ASR) systems when the speech signal is noisy or corrupted. Previous work, however, has only demonstrated the utility of visual context in an unrealistic setting, where a fixed set of words are systematically masked in the audio. In this paper, we simulate a more realistic masking scenario during model training, called RandWordMask, where the masking can occur for any word segment. Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting. Moreover, our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted. These results show that multimodal ASR systems can leverage the visual signal in more generalized noisy scenarios.

2019

pdf bib
Adversarial Removal of Demographic Attributes Revisited
Maria Barrett | Yova Kementchedjhieva | Yanai Elazar | Desmond Elliott | Anders Søgaard
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Elazar and Goldberg (2018) showed that protected attributes can be extracted from the representations of a debiased neural network for mention detection at above-chance levels, by evaluating a diagnostic classifier on a held-out subsample of the data it was trained on. We revisit their experiments and conduct a series of follow-up experiments showing that, in fact, the diagnostic classifier generalizes poorly to both new in-domain samples and new domains, indicating that it relies on correlations specific to their particular data sample. We further show that a diagnostic classifier trained on the biased baseline neural network also does not generalize to new samples. In other words, the biases detected in Elazar and Goldberg (2018) seem restricted to their particular data sample, and would therefore not bias the decisions of the model on new samples, whether in-domain or out-of-domain. In light of this, we discuss better methodologies for detecting bias in our models.

pdf bib
Understanding the Effect of Textual Adversaries in Multimodal Machine Translation
Koel Dutta Chowdhury | Desmond Elliott
Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

It is assumed that multimodal machine translation systems are better than text-only systems at translating phrases that have a direct correspondence in the image. This assumption has been challenged in experiments demonstrating that state-of-the-art multimodal systems perform equally well in the presence of randomly selected images, but, more recently, it has been shown that masking entities from the source language sentence during training can help to overcome this problem. In this paper, we conduct experiments with both visual and textual adversaries in order to understand the role of incorrect textual inputs to such systems. Our results show that when the source language sentence contains mistakes, multimodal translation systems do not leverage the additional visual signal to produce the correct translation. We also find that the degradation of translation performance caused by textual adversaries is significantly higher than by visual adversaries.

pdf bib
Compositional Generalization in Image Captioning
Mitja Nikolaus | Mostafa Abdou | Matthew Lamm | Rahul Aralikatte | Desmond Elliott
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image–sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.

pdf bib
Cross-lingual Visual Verb Sense Disambiguation
Spandana Gella | Desmond Elliott | Frank Keller
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Recent work has shown that visual context improves cross-lingual sense disambiguation for nouns. We extend this line of work to the more challenging task of cross-lingual verb sense disambiguation, introducing the MultiSense dataset of 9,504 images annotated with English, German, and Spanish verbs. Each image in MultiSense is annotated with an English verb and its translation in German or Spanish. We show that cross-lingual verb sense disambiguation models benefit from visual context, compared to unimodal baselines. We also show that the verb sense predicted by our best disambiguation model can improve the results of a text-only machine translation system when used for a multimodal translation task.

2018

pdf bib
Lessons Learned in Multilingual Grounded Language Learning
Ákos Kádár | Desmond Elliott | Marc-Alexandre Côté | Grzegorz Chrupała | Afra Alishahi
Proceedings of the 22nd Conference on Computational Natural Language Learning

Recent work has shown how to learn better visual-semantic embeddings by leveraging image descriptions in more than one language. Here, we investigate in detail which conditions affect the performance of this type of grounded language learning model. We show that multilingual training improves over bilingual training, and that low-resource languages benefit from training with higher-resource languages. We demonstrate that a multilingual model can be trained equally well on either translations or comparable sentence pairs, and that annotating the same set of images in multiple language enables further improvements via an additional caption-caption ranking objective.

pdf bib
Adversarial Evaluation of Multimodal Machine Translation
Desmond Elliott
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The promise of combining language and vision in multimodal machine translation is that systems will produce better translations by leveraging the image data. However, the evidence surrounding whether the images are useful is unconvincing due to inconsistencies between text-similarity metrics and human judgements. We present an adversarial evaluation to directly examine the utility of the image data in this task. Our evaluation tests whether systems perform better when paired with congruent images or incongruent images. This evaluation shows that only one out of three publicly available systems is sensitive to this perturbation of the data. We recommend that multimodal translation systems should be able to pass this sanity check in the future.

pdf bib
Measuring the Diversity of Automatic Image Descriptions
Emiel van Miltenburg | Desmond Elliott | Piek Vossen
Proceedings of the 27th International Conference on Computational Linguistics

Automatic image description systems typically produce generic sentences that only make use of a small subset of the vocabulary available to them. In this paper, we consider the production of generic descriptions as a lack of diversity in the output, which we quantify using established metrics and two new metrics that frame image description as a word recall task. This framing allows us to evaluate system performance on the head of the vocabulary, as well as on the long tail, where system performance degrades. We use these metrics to examine the diversity of the sentences generated by nine state-of-the-art systems on the MS COCO data set. We find that the systems trained with maximum likelihood objectives produce less diverse output than those trained with additional adversarial objectives. However, the adversarially-trained models only produce more types from the head of the vocabulary and not the tail. Besides vocabulary-based methods, we also look at the compositional capacity of the systems, specifically their ability to create compound nouns and prepositional phrases of different lengths. We conclude that there is still much room for improvement, and offer a toolkit to measure progress towards the goal of generating more diverse image descriptions.

pdf bib
Findings of the Third Shared Task on Multimodal Machine Translation
Loïc Barrault | Fethi Bougares | Lucia Specia | Chiraag Lala | Desmond Elliott | Stella Frank
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present the results from the third shared task on multimodal machine translation. In this task a source sentence in English is supplemented by an image and participating systems are required to generate a translation for such a sentence into German, French or Czech. The image can be used in addition to (or instead of) the source sentence. This year the task was extended with a third target language (Czech) and a new test set. In addition, a variant of this task was introduced with its own test set where the source sentence is given in multiple languages: English, French and German, and participating systems are required to generate a translation in Czech. Seven teams submitted 45 different systems to the two variants of the task. Compared to last year, the performance of the multimodal submissions improved, but text-only systems remain competitive.

pdf bib
Talking about other people: an endless range of possibilities
Emiel van Miltenburg | Desmond Elliott | Piek Vossen
Proceedings of the 11th International Conference on Natural Language Generation

Image description datasets, such as Flickr30K and MS COCO, show a high degree of variation in the ways that crowd-workers talk about the world. Although this gives us a rich and diverse collection of data to work with, it also introduces uncertainty about how the world should be described. This paper shows the extent of this uncertainty in the PEOPLE-domain. We present a taxonomy of different ways to talk about other people. This taxonomy serves as a reference point to think about how other people should be described, and can be used to classify and compute statistics about labels applied to people.

2017

pdf bib
Imagination Improves Multimodal Translation
Desmond Elliott | Ákos Kádár
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We decompose multimodal translation into two sub-tasks: learning to translate and learning visually grounded representations. In a multitask learning framework, translations are learned in an attention-based encoder-decoder, and grounded representations are learned through image representation prediction. Our approach improves translation performance compared to the state of the art on the Multi30K dataset. Furthermore, it is equally effective if we train the image prediction task on the external MS COCO dataset, and we find improvements if we train the translation model on the external News Commentary parallel text.

pdf bib
Cross-linguistic differences and similarities in image descriptions
Emiel van Miltenburg | Desmond Elliott | Piek Vossen
Proceedings of the 10th International Conference on Natural Language Generation

Automatic image description systems are commonly trained and evaluated on large image description datasets. Recently, researchers have started to collect such datasets for languages other than English. An unexplored question is how different these datasets are from English and, if there are any differences, what causes them to differ. This paper provides a cross-linguistic comparison of Dutch, English, and German image descriptions. We find that these descriptions are similar in many respects, but the familiarity of crowd workers with the subjects of the images has a noticeable influence on the specificity of the descriptions.

pdf bib
Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description
Desmond Elliott | Stella Frank | Loïc Barrault | Fethi Bougares | Lucia Specia
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
A Corpus of Images and Text in Online News
Laura Hollink | Adriatik Bedjeti | Martin van Harmelen | Desmond Elliott
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In recent years, several datasets have been released that include images and text, giving impulse to new methods that combine natural language processing and computer vision. However, there is a need for datasets of images in their natural textual context. The ION corpus contains 300K news articles published between August 2014 - 2015 in five online newspapers from two countries. The 1-year coverage over multiple publishers ensures a broad scope in terms of topics, image quality and editorial viewpoints. The corpus consists of JSON-LD files with the following data about each article: the original URL of the article on the news publisher’s website, the date of publication, the headline of the article, the URL of the image displayed with the article (if any), and the caption of that image. Neither the article text nor the images themselves are included in the corpus. Instead, the images are distributed as high-dimensional feature vectors extracted from a Convolutional Neural Network, anticipating their use in computer vision tasks. The article text is represented as a list of automatically generated entity and topic annotations in the form of Wikipedia/DBpedia pages. This facilitates the selection of subsets of the corpus for separate analysis or evaluation.

pdf bib
1 Million Captioned Dutch Newspaper Images
Desmond Elliott | Martijn Kleppe
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Images naturally appear alongside text in a wide variety of media, such as books, magazines, newspapers, and in online articles. This type of multi-modal data offers an interesting basis for vision and language research but most existing datasets use crowdsourced text, which removes the images from their original context. In this paper, we introduce the KBK-1M dataset of 1.6 million images in their original context, with co-occurring texts found in Dutch newspapers from 1922 - 1994. The images are digitally scanned photographs, cartoons, sketches, and weather forecasts; the text is generated from OCR scanned blocks. The dataset is suitable for experiments in automatic image captioning, image―article matching, object recognition, and data-to-text generation for weather forecasting. It can also be used by humanities scholars to analyse photographic style changes, the representation of people and societal issues, and new tools for exploring photograph reuse via image-similarity-based search.

bib
Multimodal Learning and Reasoning
Desmond Elliott | Douwe Kiela | Angeliki Lazaridou
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

Natural Language Processing has broadened in scope to tackle more and more challenging language understanding and reasoning tasks. The core NLP tasks remain predominantly unimodal, focusing on linguistic input, despite the fact that we, humans, acquire and use language while communicating in perceptually rich environments. Moving towards human-level AI will require the integration and modeling of multiple modalities beyond language. With this tutorial, our aim is to introduce researchers to the areas of NLP that have dealt with multimodal signals. The key advantage of using multimodal signals in NLP tasks is the complementarity of the data in different modalities. For example, we are less likely to nd descriptions of yellow bananas or wooden chairs in text corpora, but these visual attributes can be readily extracted directly from images. Multimodal signals, such as visual, auditory or olfactory data, have proven useful for models of word similarity and relatedness, automatic image and video description, and even predicting the associated smells of words. Finally, multimodality offers a practical opportunity to study and apply multitask learning, a general machine learning paradigm that improves generalization performance of a task by using training signals of other related tasks.All material associated to the tutorial will be available at http://multimodalnlp.github.io/

pdf bib
A Shared Task on Multimodal Machine Translation and Crosslingual Image Description
Lucia Specia | Stella Frank | Khalil Sima’an | Desmond Elliott
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
DCU-UvA Multimodal MT System Report
Iacer Calixto | Desmond Elliott | Stella Frank
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Pragmatic Factors in Image Description: The Case of Negations
Emiel van Miltenburg | Roser Morante | Desmond Elliott
Proceedings of the 5th Workshop on Vision and Language

pdf bib
Multi30K: Multilingual English-German Image Descriptions
Desmond Elliott | Stella Frank | Khalil Sima’an | Lucia Specia
Proceedings of the 5th Workshop on Vision and Language

2015

pdf bib
Describing Images using Inferred Visual Dependency Representations
Desmond Elliott | Arjen de Vries
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
Comparing Automatic Evaluation Measures for Image Description
Desmond Elliott | Frank Keller
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics
Shuly Wintner | Desmond Elliott | Konstantina Garoufi | Douwe Kiela | Ivan Vulić
Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Towards Succinct and Relevant Image Descriptions
Desmond Elliott
Proceedings of the Third Workshop on Vision and Language

pdf bib
Query-by-Example Image Retrieval using Visual Dependency Representations
Desmond Elliott | Victor Lavrenko | Frank Keller
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Image Description using Visual Dependency Representations
Desmond Elliott | Frank Keller
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing