Leveraging Object Shape and Scene Geometry for Online Multi Object Tracking using Monocular Camera

Author: Sarthak Sharma
Date: 2019-06-20
Report no: IIIT/TH/2019/64
Advisor:Madhava Krishna

Abstract

In this thesis, we present a novel approach to the Multi-object tracking (MOT) problem. Specifically, the focus is on tracking multiple traffic participants in road scenes from a monocular camera mounted on an autonomous car. While this is an extremely well-studied problem, state of the art approaches rely exclusively on appearance information (derived from image pixels) to associate bounding box detections over image sequences (aka tracking-by-detection). They fail to leverage the abundant world-space information that can be harnessed by usage of simple projective geometry, and other semantic cues. We show that, moving beyond pixels and incorporating geometric and object shape cues improves the performance of tracking-by-detection framework by leaps and bounds. We propose several 2D (appearance/pixel space) and 3D (world-space) cues and constructs pair wise costs that can readily be incorporated into any tracking-by-detection workflow. Moreover, all the proposed cues (both 2D and 3D) are computed using image sequences from a monocular camera. Although reasoning about the 3D shape and pose of object in a monocular setting is ill posed, we demonstrate that incorporating prior knowledge about the scene in form of :prior known camera height (which is the case in autonomous driving systems) , the assumption that vehicles being co-planar with the ego vehicle and with knowledge of how 3D shapes project in the image; can help in reasoning about the reverse process : the 3D object shape, pose and motion in a monocular setting. This knowledge about the scene and the objects helps in forming association costs in 3D. We train a Convolutional Neural Network to reason about semantic keypoints of vehicles, which not only provides observations to reason about objects in 3D, but provides complimentary 2D cues about the object appearance for the task of data association. The proposed costs are agnostic to the data association method and can be incorporated into any optimization framework. These costs are easy to implement, can be computed in real-time, and complement each other to account for possible errors in a tracking-by-detection framework. We perform an extensive analysis of the designed costs and empirically demonstrate consistent improvement over the state-of-the-art under varying conditions that employ a range of object detectors and exhibit a variety in camera and object motions. We showcase that using the simplest of associations frameworks (two-frame Hungarian assignment), the presented approach surpasses the state-of-the-art in multi-object-tracking on road scenes.

Full thesis: pdf

Centre for Robotics

IIIT Hyderabad Publications

Leveraging Object Shape and Scene Geometry for Online Multi Object Tracking using Monocular Camera

Abstract