Multimodal Machine Learning: Integrating Language, Vision and Speech

Louis-Philippe Morency, Tadas Baltrušaitis


Abstract
Multimodal machine learning is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. With the initial research on audio-visual speech recognition and more recently with image and video captioning projects, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities.This tutorial builds upon a recent course taught at Carnegie Mellon University during the Spring 2016 semester (CMU course 11-777) and two tutorials presented at CVPR 2016 and ICMI 2016. The present tutorial will review fundamental concepts of machine learning and deep neural networks before describing the five main challenges in multimodal machine learning: (1) multimodal representation learning, (2) translation & mapping, (3) modality alignment, (4) multimodal fusion and (5) co-learning. The tutorial will also present state-of-the-art algorithms that were recently proposed to solve multimodal applications such as image captioning, video descriptions and visual question-answer. We will also discuss the current and upcoming challenges.
Anthology ID:
P17-5002
Volume:
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Month:
July
Year:
2017
Address:
Vancouver, Canada
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3–5
Language:
URL:
https://www.aclweb.org/anthology/P17-5002
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
http://aclanthology.lst.uni-saarland.de/P17-5002.pdf