Referring to Objects in Videos Using Spatio-Temporal Identifying Descriptions

Peratham Wiriyathammabhum, Abhinav Shrivastava, Vlad Morariu, Larry Davis


Abstract
This paper presents a new task, the grounding of spatio-temporal identifying descriptions in videos. Previous work suggests potential bias in existing datasets and emphasizes the need for a new data creation schema to better model linguistic structure. We introduce a new data collection scheme based on grammatical constraints for surface realization to enable us to investigate the problem of grounding spatio-temporal identifying descriptions in videos. We then propose a two-stream modular attention network that learns and grounds spatio-temporal identifying descriptions based on appearance and motion. We show that motion modules help to ground motion-related words and also help to learn in appearance modules because modular neural networks resolve task interference between modules. Finally, we propose a future challenge and a need for a robust system arising from replacing ground truth visual annotations with automatic video object detector and temporal event localization.
Anthology ID:
W19-1802
Volume:
Proceedings of the Second Workshop on Shortcomings in Vision and Language
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Venues:
NAACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14–25
Language:
URL:
https://www.aclweb.org/anthology/W19-1802
DOI:
10.18653/v1/W19-1802
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/W19-1802.pdf
Supplementary:
 W19-1802.Supplementary.pdf