Towards Improving Vision-Language Multimodal Systems

Author: Kshitij Gupta
Date: 2022-06-17
Report no: IIIT/TH/2022/73
Advisor:Radhika Mamidi

Abstract

We perceive the natural world through multiple input sources, with vision and language being one of the most important. The fields of computer vision and natural language processing have experienced significant progress along with the advancement of high computing power and advanced techniques in deep learning. At the same time, the intersection of the two modalities is still in its nascent stages. In this thesis, we aim to explore multimodality through several multimodal problems to understand better the shortcomings of the current state of research in the field. We also explore the multimodal problems from a lingual perspective as the current focus of majority multimodal efforts are for English only. In brief, we explore multimodal propaganda detection, multimodal machine translation, and visual question answering extensively. First, we explore a multimodal classification task in which we train models for the task of propaganda detection in memes. We also explore the importance of visual modality in the task and propose a methodology to exploit robust textual transformers using an ensemble of text and vision-language transformers to improve the performance in textually dominant tasks. Second, we explore the task of multimodal machine translation. We propose a methodology to exploit the pre-training of a textual machine translation system by bringing the visual cues to a textual domain by extracting object tags from the image. Finally, we explore the task of visual question answering in which we propose a methodology to train non-English multimodal transformers through knowledge distillation using a machine-translated dataset. We scoped the work to visual question answering, but it can also be extended to other visionlanguage tasks. To the best of our knowledge, we propose state-of-the-art models for multimodal machine translation for the English-Hindi language pair. Further, we also propose a novel approach to train visual-lingual models in different languages with the aid of machine translation and achieve state-of-the-art performance in the apanese and Hindi visual question answering tasks.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards Improving Vision-Language Multimodal Systems

Abstract