Vision and Language Navigation in Autonomous Driving

Author: Nivedita Rufus
Date: 2022-07-04
Report no: IIIT/TH/2022/105
Advisor:Madhava Krishna,Vineet Gandhi

Abstract

The launch of level 5 autonomous driving is imminent which would result in removal of controls like steering wheel, acceleration and brakes. This warrants the need for a better system to direct the self-driving agent to perform maneuvers if the passenger wishes to do so. Hence, this thesis proposes a set of solutions tackle the problem of vision and language navigation in an autonomous driving setting which has been broken down into multiple sub problems. The first problem we attempt to solve, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. The framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the given sentence/phrase). We use pre-trained networks for obtaining the initial embeddings and learn a transformation layer on top of the text embedding.We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Humans have a natural ability to effortlessly comprehend linguistic commands such as “park next to the yellow sedan” and instinctively know which region of the road the vehicle should navigate. Extending this ability to autonomous vehicles is the next step towards creating fully autonomous agents that respond and act according to human commands where knowing only the location of the object of interest in the scene is not enough. To this end, a novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command is proposed. RNR is different from Referring Image Segmentation (RIS), which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. For example, for a command “park next to the yellow sedan,” RIS will aim to segment the referred sedan, and RNR aims to segment the suggested parking region on the road. We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car dataset with segmentation masks for the regions described by the linguistic commands. We benchmark the proposed dataset using a novel transformer-based architecture. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework.

Full thesis: pdf

Centre for Robotics

IIIT Hyderabad Publications

Vision and Language Navigation in Autonomous Driving

Abstract