Towards Semantic Scene Understanding of Cluttered Indoor Environments

Author: Sanchit Aggarwal
Date: 2016-07-27
Report no: IIIT/TH/2016/46
Advisor:Anoop M Namboodiri

Abstract

Parsing natural scenes into semantically meaningful entities is one of the open problems in computer vision. Due to complexity present at various levels ranging from scene-object, object-object and objectlayout, the problem becomes challenging. Restricting the problem to indoor scenes makes it slightly easier to track. Holistic understanding of indoor scene involves detection of object, recovery of 3D geometry of the object, estimating spatial layout of the scene and classification of the indoor scene. It helps in executing the high-level task such as navigation, free space estimation, object placement and manipulation. In this thesis, we integrate the information at the various levels in the cluttered indoor environments for efficient semantic segmentation. We use appearance and geometric properties of different entities for estimating free space and localising objects in the given indoor scene. We believe that this work can enable a variety of applications where the semantic understanding of indoor scene is required. For example, mobility assessment, robot navigation, path planning and surveillance, object manipulation, grasping, learning object support order, visual search and 3D reconstruction. In this thesis, we first attempt the problem of learning and estimating free space i.e., floor regions in indoor scenes from a single image. Estimating free space is challenging due to high appearance variability within the floor and non-floor regions. It is even harder to segment floor regions when clutter, specular reflections, shadows and textured floors are present within the scene. We propose a framework which utilises a generic classifier of appearance cues as well as floor density estimates, both trained from a variety of indoor images. The result of the classifier is then adapted to a specific test image where we integrate appearance, position and geometric cues in an iterative framework. A Markov Random Field framework is used to integrate the cues to segment floor regions. The proposed approach is also flexible in situations where scene avoids assumptions like Manhattan world scene or restricting clutter only to wall-floor boundaries. Moving from detecting free space or floor regions, we use the appearance and geometric properties to estimate more general entities in the cluttered scene. These entities or objects are the basic units of any indoor scene, free space being one of them. For example different configurations of the set of objects in scenarios ranging from indoor scenes with cluttered tabletops to indoor scenes of offices, home, corridors, classrooms etc. Due to high variability in indoor scenes itself and complexity due to intraclass variability within objects, we restricted our attention to table top scenarios with known objects. The problem of estimating the layout of table top scenes is challenging due to the presence of clutter, objects of homogeneous appearance with that of the table surface, object-object occlusions and objects having irregular shapes and sizes. We train an ensemble of classifiers over appearance cues from various images of the known objects with different poses. We learn the meta-data (pose, shape) associated with each object and try to estimate its pose and shape in a given cluttered scene. The approach predicts the detailed layout of the objects present and free space on the table top where the objects can be placed. We created two datasets for the above-mentioned work. We first created an RGB based ”CVIT Indoor Scene dataset” of 110 images from various buildings in our campus that included cluttered floor regions. The images contained the wide variety of indoor scenes including classrooms, living rooms, library, corridors, two or three visible walls, etc. It also consisted of images with varied texture within the floor, specular highlights, shadows and scenes with cluttered floors due to furniture or other obstacles, where the clutter is not just confined to image boundaries. We also created a second RGBD dataset ”3DMOS” of 50 objects with different shapes and sizes. Each object had 15 images of different pose and view. It also had 10 cluttered table top scenes from sparsely cluttered to densely cluttered. We have made the dataset publicly available for the research community. The proposed approaches successfully demonstrate the robustness and efficiency on the various mentioned complex situations in indoor scenes. We believe that this work can play a significant role in the true understanding of an indoor scene and its semantics. It could also help in increased interaction of robotic agents or humans with the surrounding environment when this information is given to them. Keywords: Semantic Segmentation, Indoor Scene, Space Estimation, Object Manipulation, Cognitive Vision.

Full thesis: pdf

Centre for Visual Information Technology

IIIT Hyderabad Publications

Towards Semantic Scene Understanding of Cluttered Indoor Environments

Abstract