IIIT Hyderabad Publications |
|||||||||
|
Character N-Gram Spotting on Handwritten Documents using Weakly-Supervised SegmentationAuthors: Udit Roy,Naveen Sankaran T,Pramod Sankar, C V Jawahar Conference: International Conference on Document Analysis and Recognition, 25-28 Aug. 2013, Washington DC, USA. Date: 2013-08-25 Report no: IIIT/TR/2013/78 AbstractIn this paper, we present a solution towards building a retrieval system over handwritten document images that i) is recognition-free, ii) allows text-querying, iii) can retrieve at sub- word level, iv) can search for out-of-vocabulary words. Unlike previous approaches that operate at either character or word lev- els, we use character n-gram images ( CNG -img) as the retrieval primitive. CNG -img are sequences of character segments, that are represented and matched in the image-space. The word-images are now treated as a bag-of- CNG -img, that can be indexed and matched in the feature space. This allows for recognition-free search (query-by-example), which can retrieve morphologically similar words that have matching sub-words. Further, to enable query-by-keyword, we build an automated scheme to generate labeled exemplars for characters and character n-grams, from unconstrained handwritten documents. We pose this problem as one of weakly-supervised learning, where character/n-gram labeling is obtained automatically from the word labels. The resulting retrieval system can answer queries from an unlimited vocabulary. The approach is demonstrated on the George Wash- ington collection, results show major improvement in retrieval performance as compared to word-recognition and word-spotting methods. Full paper: pdf Centre for Visual Information Technology |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |