IIIT Hyderabad Publications |
|||||||||
|
Joint Semantic and Motion Segmentation for Enhanced Scene UnderstandingAuthor: Nazrul Haque Athar Date: 2017-07-27 Report no: IIIT/TH/2017/56 Advisor:Madhava Krishna AbstractThe thesis presents end-to-end frameworks for joint semantic and motion segmentation in outdoor dynamic scenes in both monocular and stereo setup. Dynamic scene understanding is a challenging problem and motion segmentation plays a crucial role in solving it. Incorporating semantics and motion enhances the overall perception of the dynamic scene. For applications of outdoor robotic navigation, joint learning methods have not been extensively used for extracting spatio-temporal features or adding different priors into the formulation. The task becomes even more challenging without stereo information being incorporated. The thesis proposes an approach to fuse semantic features and motion clues using CNNs, to address the problem of monocular semantic motion segmentation. We deduce semantic and motion labels by integrating optical flow as a constraint with semantic features into dilated convolution network. The pipeline consists of three main stages i.e Feature extraction, Feature amplification and Multi Scale Context Aggregation to fuse the semantics and flow features. Our joint formulation shows significant improvements in monocular motion segmentation over the state of the art methods on challenging KITTI tracking dataset. The focus in the second part of the thesis is temporally consistent joint semantic and motion segmentation. Segmenting moving objects in a video sequence has been a challenging problem and critical to outdoor robotic navigation. While recent literature have laid focus on regularizing object labels over a sequence of frames, exploiting the spatio-temporal features for motion segmentation has been scarce. Particularly in real world dynamic scenes, existing approaches tend to fail in segmenting moving objects with large camera motion. In this thesis, we present an approach for exploiting semantic information and temporal constraints in a joint framework for motion segmentation in a video. We propose a formulation for inferring per-frame joint semantic and motion labels using semantic potentials from dilated CNN framework and motion potentials from depth and geometric constraints. We integrate the potentials obtained into a 3D(space-time) fully connected CRF framework with overlapping/connected blocks. We solve for a feature space embedding in the spatio-temporal space by enforcing temporal constraints using optical flow and long term tracks as a least-squares problem. We evaluate our approach on KITTI Tracking benchmark and demonstrate results superior to the state-of-the-art in motion segmentation. Full thesis: pdf Centre for Robotics |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |