IIIT Hyderabad Publications |
|||||||||
|
Robust Visual Question-Answering using Generative Vision Language ModelsAuthor: Rahul Mehta 2020900039 Date: 2024-07-08 Report no: IIIT/TH/2024/160 Advisor:Vasudeva Varma AbstractVisual Question Answering (VQA) represents a long standing challenge of combining computer vision and natural language processing, where machines has to answer questions about visual content such as images or videos. The challenge lies in not only recognizing objects, scenes, and relationships within the visual input but also comprehending the context of the questions posed in natural language. VQA systems are designed to understand the semantics of both the visual and textual modalities, requiring sophisticated algorithms to extract meaningful features from images or videos and integrate them with linguistic cues to generate accurate responses. We release a visual question answering (VQA) system for electrical circuit images that could be useful as a quiz generator, design and verification assistant or an electrical diagnosis tool. Although there exists a vast literature on VQA, to the best of our knowledge, there is no existing work on VQA for electrical circuit images. To this end, we curate a new dataset, circuitVQA, of 115K+ questions on 5725 electrical images with ∼70 circuit symbols. The dataset contains schematic as well as hand-drawn images. The questions span various categories like counting, value, junction and position based questions. To be effective, models must demonstrate skills like object detection, text recognition, spatial understanding, question intent understanding and answer generation. We experiment with multiple foundational visiolinguistic models for this task and find that a fine-tuned BLIP model with component descriptions as additional input provides best results. Hallucination in vision language models and their language model part is also a challenging area of research that directly affects a model’s performance. We systematically study this phenomena and work on quantifying hallucination in a Vision Question Answering system. We also work on detecting hallucinations in large language models using ensemble of classifier models. Finally, we attempt to mitigate the hallucination problem by utilizing reinforcement learning based rewards to have better text generation capability for these language models. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |