IIIT Hyderabad Publications
Understanding Text in Scene Images
Author: Anand Mishra
Report no: IIIT/TH/2016/72
Advisor:C V Jawahar, Karteek Alahari
With the rapid growth of camera-based mobile devices, applications that answer questions such as, “What does this sign say?" are becoming increasingly popular. This is related to the problem of optical character recognition (OCR) where the task is to recognize text occurring in images. The OCR problem has a long history in the computer vision community. However, the success of OCR systems is largely restricted to text from scanned documents. Scene text, such as text occurring in images captured with a mobile device, exhibits a large variability in appearance. Recognizing scene text has been challenging, even for the state-of-the-art OCR methods. Many scene understanding methods recognize objects and regions like roads, trees, sky in the image successfully, but tend to ignore the text on the sign board. Towards filling this gap, we devise robust techniques for scene text recognition and retrieval in this thesis. This thesis presents three approaches to address scene text recognition problems. First, we propose a robust text segmentation (binarization) technique, and use it to improve the recognition performance. We pose the binarization problem as a pixel labeling problem and define a corresponding novel energy function which is minimized to obtain a binary segmentation image. This method makes it possible to use standard OCR systems for recognizing scene text. Second, we present an energy minimization framework that exploits both bottom-up and top-down cues for recognizing words extracted from street images. The bottom-up cues are derived from detections of individual text characters in an image. We build a conditional random field model on these detections to jointly model the strength of the detections and the interactions between them. These interactions are top-down cues obtained from a lexicon-based prior, i.e., language statistics. The optimal word represented by the text image is obtained by minimizing the energy function corresponding to the random field model. The proposed method significantly improves the scene text recognition performance. Thirdly, we present a holistic word recognition framework, which leverages scene text image and synthetic images generated from lexicon words. We then recognize the text in an image by matching the scene and synthetic image featureswith our novel weighted dynamic time warping approach. This approach does not require any language statistics or language specific character-level annotations. Finally, we address the problem of image retrieval using textual cues, and demonstrate large-scale text-to-image retrieval. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that this approach, despite being based on state-of-the art methods, is insufficient, and propose an approach without relaying on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. We evaluate our proposed methods extensively on a number of scene text benchmark datasets, namely, street view text, ICDAR 2003, 2011 and 2013, and a new dataset IIIT 5K-word, we introduced, and show better performance than all the comparable methods. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely, IIIT scene text retrieval, Sports-10K and TV series-1M, we introduced.
Full thesis: pdf
Centre for Visual Information Technology
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved.