IIIT Hyderabad Publications |
|||||||||
|
Integrating Vision-Language Models for Enhanced Scene Understanding in Autonomous DrivingAuthor: Tushar Choudhary 2019111019 Date: 2024-06-27 Report no: IIIT/TH/2024/121 Advisor:Madhava Krishna AbstractAutonomous driving (AD) systems require a comprehensive understanding of their surroundings to navigate safely without human intervention. This involves interpreting intricate scenes, including object interactions and anticipating future implications, which are essential for making informed decisions. However, existing AD systems often depend on task-specific models trained on limited datasets, limiting their adaptability to diverse real-world scenarios. Recent advancements in Large Language and Large Vision Language models (LLMs and LVLMs) offer a promising solution to overcome these limitations by providing general-purpose scene understanding capabilities. This work introduces Talk2BEV, a large vision-language model interface designed for bird’s-eye view (BEV) maps in autonomous driving contexts. While previous perception systems for autonomous driving have mainly focused on predefined (closed) sets of object categories and driving scenarios, Talk2BEV integrates recent advances in general-purpose language and vision models with BEV map representations. This integration eliminates the need for task-specific models, allowing a single system to handle various autonomous driving tasks, including visual and spatial reasoning, predicting traffic actors’ intentions, and decision-making based on visual cues. Talk2BEV has been extensively evaluated on a wide range of scene understanding tasks that require the interpretation of free-form natural language queries and grounding these queries to the visual context embedded in the language-enhanced BEV map. Notably, this approach requires no additional training, offering flexibility and enabling rapid deployment across different domains and tasks. Additionally, this work extends to a lightweight Vision Language Network (VLN) aimed at addressing the challenge of estimating a goal point location based on a given language command as an intermediate representation. A generalized open-set LLM or a human driver can understand an autonomous driving scenario and suggest an appropriate action, which can then be consumed by the VLN to predict an optimized associated goal point, subsequently used by downstream planners. This extension enhances explainability and efficiency in autonomous driving tasks as we have the action and goal-point as an intermediate input/output. These contributions aim to advance the development of generalizable perception systems for autonomous vehicles by emphasizing the integration of language understanding with visual reasoning capabilities. Full thesis: pdf Centre for Robotics |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |