Character N-Gram Spotting on Handwritten Documents using Weakly-Supervised Segmentation

Authors: Udit Roy,Naveen Sankaran T,Pramod Sankar, C V Jawahar
Conference: International Conference on Document Analysis and Recognition, 25-28 Aug. 2013, Washington DC, USA.

Date: 2013-08-25
Report no: IIIT/TR/2013/78

Abstract

In this paper, we present a solution towards building a retrieval system over handwritten document images that i) is recognition-free, ii) allows text-querying, iii) can retrieve at sub- word level, iv) can search for out-of-vocabulary words. Unlike previous approaches that operate at either character or word lev- els, we use character n-gram images ( CNG -img) as the retrieval primitive. CNG -img are sequences of character segments, that are represented and matched in the image-space. The word-images are now treated as a bag-of- CNG -img, that can be indexed and matched in the feature space. This allows for recognition-free search (query-by-example), which can retrieve morphologically similar words that have matching sub-words. Further, to enable query-by-keyword, we build an automated scheme to generate labeled exemplars for characters and character n-grams, from unconstrained handwritten documents. We pose this problem as one of weakly-supervised learning, where character/n-gram labeling is obtained automatically from the word labels. The resulting retrieval system can answer queries from an unlimited vocabulary. The approach is demonstrated on the George Wash- ington collection, results show major improvement in retrieval performance as compared to word-recognition and word-spotting methods.

Full paper: pdf

Centre for Visual Information Technology

IIIT Hyderabad Publications

Character N-Gram Spotting on Handwritten Documents using Weakly-Supervised Segmentation

Abstract