IIIT Hyderabad Publications |
|||||||||
|
Towards Multimodal Reasoning and Inference using Large Language ModelsAuthor: Suyash Vardhan Mathur 2019114006 Date: 2024-07-01 Report no: IIIT/TH/2024/138 Advisor:Manish Shrivastava AbstractGreat strides have been made in Natural Language Processing (NLP) and Computer Vision (CV) in recent years. Large Language Models (LLMs), especially those of the parameter sizes of GPT-3.5 and GPT-4 have revolutionized tasks ranging from summarization to question answering, while Vision Transformers have enabled the development of highly efficient image segmentation, object detection, image synthesis models. However, there is still much work to be done in Multimodal space, involving the usage of both NLP and CV to process input/output involving both text and images. In this dissertation, we work towards Multimodal Inference and Reasoning by LLMs and pursue research questions related to Multimodal Question Answering by such LLMs through three distinct problems: Multimodal Emotion-Cause Pair Extraction in Conversations, Question Answering using LLMs for Unconventional Reasoning, and using Multimodal Large Language Models (MLLMs) to perform Knowledge-aware Inference and Reasoning over Semi-Structured Multi-modal Tables. We first explore Multimodality using one of the most fundamental NLP tasks – Emotion Analysis through Multimodal Emotion-Cause Pair Extraction in Conversation. We model the task as both an utterance-labelling and a sequence-labelling problem and experiment with different encoders to encode the visual, audio and textual modalities in the conversations. We conducted a comparative study that involved baselines using different encoders with an MLP, BiLSTMs, and those incorporating a BiLSTM+CRF layer. Going further, we explore the task of Unconventional Reasoning using LLMs on questions involving lateral thinking, which requires looking at problems from an unconventional perspective and defying existing conceptions and notions. We experiment on the BrainTeaser Dataset using few-shot prompts, including explanations for reasoning in the examples for the model to understand the unconventional reasoning tasks better, improving over the zero-shot LLM baseline results. Building upon the areas of Multimodality and using LLMs for reasoning, we propose the task of Knowledge-aware Question-Answering over Semi-structured Multi-modal tables and experiment with SOTA LLM and MLLM for solving the task. We create the MultimodalTabQA dataset, which consists of 35,111 questions over 16,941 tables recast from three existing tabular question-answering datasets. The dataset involves complex questions requiring handling multiple images as input, performing knowledgeaware entity disambiguation, understanding the semi-structured information represented and understanding the entities in the context of the table. We experiment with three different approaches to answering these questions and demonstrate the capabilities of SOTA LLMs on such new tasks. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |