First Person Action Recognition

Author: Suriya Singh
Date: 2016-11-01
Report no: IIIT/TH/2016/76
Advisor:Chetan Arora,C V Jawahar

Abstract

Egocentric cameras are wearable cameras mounted on a person’s head or shoulder. With their ability to capture what the wearer is seeing, such cameras are spawning new set of exciting applications in computer vision. Recognising activity of the wearer from an egocentric video is an important but challenging problem. This problem is more challenging than third person activity recognition due to unavailability of wearer’s pose. Unstructured movement of the camera due to natural head motion of the wearer causes sharp changes in the visual field of the egocentric camera making problem even more challenging. This causes many standard third person action recognition techniques to perform poorly on such videos. On the other hand, objects present in the scene and hand gestures of the wearer are the most important cues for first person action recognition. However, such cues are difficult to segment and recognize in an egocentric video. Carefully crafted features based on hands and objects cues for the problem have been shown to be successful for limited targeted datasets. In the first part of our work, we propose a novel representation of the first person actions derived from feature trajectories. The features are simple to compute using standard feature tracking and does not assume segmentation of hand/objects or recognizing object or hand pose unlike in many previous approaches. We train a bag of words classifier with the proposed features and report a significant performance improvement on publicly available datasets. In the second part of the thesis, we propose convolutional neural networks ( CNN s) for end to end learning and classification of wearers actions. The proposed network makes use of egocentric cues by capturing hand pose, head motion and saliency map. The proposed network model is compact and therefore can be trained from relatively small number of labeled egocentric videos that are available in egocentric settings. We show that the proposed network can generalize and give state of the art performance on various egocentric action datasets widely different from each other visually as well as dynamically.

Full thesis: pdf

Centre for Visual Information Technology

IIIT Hyderabad Publications

First Person Action Recognition

Abstract