The Use of Object Labels and Spatial Prepositions as Keywords in a Web-Retrieval-Based Image Caption Generation System

Brandon Birmingham, Adrian Muscat


Abstract
In this paper, a retrieval-based caption generation system that searches the web for suitable image descriptions is studied. Google’s reverse image search is used to find potentially relevant web multimedia content for query images. Sentences are extracted from web pages and the likelihood of the descriptions is computed to select one sentence from the retrieved text documents. The search mechanism is modified to replace the caption generated by Google with a caption composed of labels and spatial prepositions as part of the query’s text alongside the image. The object labels are obtained using an off-the-shelf R-CNN and a machine learning model is developed to predict the prepositions. The effect on the caption generation system performance when using the generated text is investigated. Both human evaluations and automatic metrics are used to evaluate the retrieved descriptions. Results show that the web-retrieval-based approach performed better when describing single-object images with sentences extracted from stock photography websites. On the other hand, images with two image objects were better described with template-generated sentences composed of object labels and prepositions.
Anthology ID:
W17-2002
Volume:
Proceedings of the Sixth Workshop on Vision and Language
Month:
April
Year:
2017
Address:
Valencia, Spain
Venues:
VL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11–20
Language:
URL:
https://www.aclweb.org/anthology/W17-2002
DOI:
10.18653/v1/W17-2002
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W17-2002.pdf