IIIT Hyderabad Publications |
|||||||||
|
Error Detection in Indic OCRsAuthors: Vinitha VS, C V Jawahar Conference: 12th IAPR International Workshop on Document Analysis Systems (DAS-2016 2016) Location Santorini, Greece Date: 2016-04-11 Report no: IIIT/TR/2016/74 AbstractA good post processing module is an indispensable part of an OCR pipeline. In this paper, we propose a novel method for error detection in Indian language OCR output. Our solution uses a recurrent neural network ( RNN ) for classification of a word as an error or not. We propose a generic error detection method and demonstrate its effectiveness on four popular Indian languages. We divide the words into their constituent aksharas and use their bigram and trigram level information to build a feature representation. In order to train the classifier on incorrect words, we use the mis-recognized words in the output of the OCR . In addition to RNN , we also explore the effectiveness of a generative model such as GMM for our task and demonstrate an improved performance by combining both the approaches. We tested our method on four popular Indian languages and report an average error detection performance above 80%. Full paper: pdf Centre for Visual Information Technology |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |