Grounding Semantic Roles in Images

Carina Silberer, Manfred Pinkal


Abstract
We address the task of visual semantic role labeling (vSRL), the identification of the participants of a situation or event in a visual scene, and their labeling with their semantic relations to the event or situation. We render candidate participants as image regions of objects, and train a model which learns to ground roles in the regions which depict the corresponding participant. Experimental results demonstrate that we can train a vSRL model without reliance on prohibitive image-based role annotations, by utilizing noisy data which we extract automatically from image captions using a linguistic SRL system. Furthermore, our model induces frame—semantic visual representations, and their comparison to previous work on supervised visual verb sense disambiguation yields overall better results.
Anthology ID:
D18-1282
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2616–2626
Language:
URL:
https://www.aclweb.org/anthology/D18-1282
DOI:
10.18653/v1/D18-1282
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/D18-1282.pdf