Josiah Wang


2019

pdf bib
VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions
Pranava Madhyastha | Josiah Wang | Lucia Specia
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We address the task of evaluating image description generation systems. We propose a novel image-aware metric for this task: VIFIDEL. It estimates the faithfulness of a generated caption with respect to the content of the actual image, based on the semantic similarity between labels of objects depicted in images and words in the description. The metric is also able to take into account the relative importance of objects mentioned in human reference descriptions during evaluation. Even if these human reference descriptions are not available, VIFIDEL can still reliably evaluate system descriptions. The metric achieves high correlation with human judgments on two well-known datasets and is competitive with metrics that depend on and rely exclusively on human references.

2018

pdf bib
Object Counts! Bringing Explicit Detections Back into Image Captioning
Josiah Wang | Pranava Swaroop Madhyastha | Lucia Specia
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

The use of explicit object detectors as an intermediate step to image captioning – which used to constitute an essential stage in early work – is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a mid-level image embedding. We argue that explicit detections provide rich semantic information, and can thus be used as an interpretable representation to better understand why end-to-end image captioning systems work well. We provide an in-depth analysis of end-to-end image captioning by exploring a variety of cues that can be derived from such object detections. Our study reveals that end-to-end image captioning systems rely on matching image representations to generate captions, and that encoding the frequency, size and position of objects are complementary and all play a role in forming a good image representation. It also reveals that different object categories contribute in different ways towards image captioning.

pdf bib
Defoiling Foiled Image Captions
Pranava Swaroop Madhyastha | Josiah Wang | Lucia Specia
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We address the task of detecting foiled image captions, i.e. identifying whether a caption contains a word that has been deliberately replaced by a semantically similar word, thus rendering it inaccurate with respect to the image being described. Solving this problem should in principle require a fine-grained understanding of images to detect subtle perturbations in captions. In such contexts, encoding sufficiently descriptive image information becomes a key challenge. In this paper, we demonstrate that it is possible to solve this task using simple, interpretable yet powerful representations based on explicit object information over multilayer perceptron models. Our models achieve state-of-the-art performance on a recently published dataset, with scores exceeding those achieved by humans on the task. We also measure the upper-bound performance of our models using gold standard annotations. Our study and analysis reveals that the simpler model performs well even without image information, suggesting that the dataset contains strong linguistic bias.

pdf bib
End-to-end Image Captioning Exploits Distributional Similarity in Multimodal Space
Pranava Swaroop Madhyastha | Josiah Wang | Lucia Specia
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn ‘distributional similarity’ in a multimodal feature space, by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the ‘image’ side of image captioning, and vary the input image representation but keep the RNN text generation model of a CNN-RNN constant. Our analysis indicates that image captioning models (i) are capable of separating structure from noisy input representations; (ii) experience virtually no significant performance loss when a high dimensional representation is compressed to a lower dimensional space; (iii) cluster images with similar visual and linguistic information together. Our experiments all point to one fact: that our distributional similarity hypothesis holds. We conclude that, regardless of the image representation, image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.

2017

pdf bib
Sheffield MultiMT: Using Object Posterior Predictions for Multimodal Machine Translation
Pranava Swaroop Madhyastha | Josiah Wang | Lucia Specia
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
Cross-validating Image Description Datasets and Evaluation Metrics
Josiah Wang | Robert Gaizauskas
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The task of automatically generating sentential descriptions of image content has become increasingly popular in recent years, resulting in the development of large-scale image description datasets and the proposal of various metrics for evaluating image description generation systems. However, not much work has been done to analyse and understand both datasets and the metrics. In this paper, we propose using a leave-one-out cross validation (LOOCV) process as a means to analyse multiply annotated, human-authored image description datasets and the various evaluation metrics, i.e. evaluating one image description against other human-authored descriptions of the same image. Such an evaluation process affords various insights into the image description datasets and evaluation metrics, such as the variations of image descriptions within and across datasets and also what the metrics capture. We compute and analyse (i) human upper-bound performance; (ii) ranked correlation between metric pairs across datasets; (iii) lower-bound performance by comparing a set of descriptions describing one image to another sentence not describing that image. Interesting observations are made about the evaluation metrics and image description datasets, and we conclude that such cross-validation methods are extremely useful for assessing and gaining insights into image description datasets and evaluation metrics for image descriptions.

pdf bib
SHEF-Multimodal: Grounding Machine Translation on Images
Kashif Shah | Josiah Wang | Lucia Specia
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Don’t Mention the Shoe! A Learning to Rank Approach to Content Selection for Image Description Generation
Josiah Wang | Robert Gaizauskas
Proceedings of the 9th International Natural Language Generation conference

2015

pdf bib
Combining Geometric, Textual and Visual Features for Predicting Prepositions in Image Descriptions
Arnau Ramisa | Josiah Wang | Ying Lu | Emmanuel Dellandrea | Francesc Moreno-Noguer | Robert Gaizauskas
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Defining Visually Descriptive Language
Robert Gaizauskas | Josiah Wang | Arnau Ramisa
Proceedings of the Fourth Workshop on Vision and Language

pdf bib
Generating Image Descriptions with Gold Standard Visual Inputs: Motivation, Evaluation and Baselines
Josiah Wang | Robert Gaizauskas
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

2014

pdf bib
A Poodle or a Dog? Evaluating Automatic Image Annotation Using Human Descriptions at Different Levels of Granularity
Josiah Wang | Fei Yan | Ahmet Aker | Robert Gaizauskas
Proceedings of the Third Workshop on Vision and Language