IIIT Hyderabad Publications |
|||||||||
|
Align Me : A framework to generate Parallel Corpus Using OCRs & Bilingual DictionariesAuthors: Priyam Bakliwal,devadathv.v@research.iiit.ac.in ,C V Jawahar Conference: 26th International Conference on Compu- tational Linguistics (COLING-2016 2016) Location Osaka, Japan Date: 2016-12-13 Report no: IIIT/TR/2016/64 AbstractMultilingual processing tasks like statistical machine translation and cross language information retrieval rely mainly on availability of accurate parallel corpora. Manual construction of such corpus can be extremely expensive and time consuming. In this paper we present a simple yet efficient method to generate huge amount of reasonably accurate parallel corpus with minimal user efforts. We utilize the availability of large number of English books and their corresponding translations in other languages to build parallel corpus. Optical Character Recognition systems are used to digitize such books. We propose a robust dictionary based parallel corpus generation system for alignment of multilingual text at different levels of granularity (sentence, paragraphs, etc). We show the performance of our proposed method on a manually aligned dataset of 300 Hindi-English sentences and 100 English-Malayalam sentences. Full paper: pdf Centre for Visual Information Technology |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |