An Empirical Study of Effectiveness of Post-processing in Indic Scripts

Authors: Vinitha VS,Minesh Mathew, C V Jawahar
Conference: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR-2017 2017)
Location Kyoto, Japan
Date: 2017-11-09
Report no: IIIT/TR/2017/53

Abstract

This paper explores the effectiveness of statistical language model ( SLM ) and dictionary based methods for detection and correction of errors in Indic OCR output. In SLM , we use unicode level ngrams for building the language model. We compare its performance with akshara level ngrams and find that akshara level ngrams perform better in detecting the errors when compared to unicode level ngrams. We experimentally analyze the performance of Indic OCR post-processing using dictionary method, compare the performance with English and analyze the reasons for the under-performance in Indic scripts. We use four major Indian languages for our experiments, namely Hindi, Gurumukhi, Telugu and Malayalam.

Full paper: pdf

Centre for Visual Information Technology

IIIT Hyderabad Publications

An Empirical Study of Effectiveness of Post-processing in Indic Scripts

Abstract