Parallel Corpora and Linguistic Resource Creation for Statistical Machine Translation

Author: Jayendra Rakesh Yeka
Date: 2017-12-23
Report no: IIIT/TH/2017/92
Advisor:Dipti Misra Sharma,Radhika Mamidi

Abstract

Statistical Machine Translation (SMT) is an automated approach to Natural-Language translation, that utilizes statistical models generated from the bilingual parallel corpora. The system is thus funda- mentally reliant on the bilingual parallel corpora, to translate text from one language to another. Ad- ditionally, in the past several decades, research in SMT has been focused on augmenting syntactic and linguistic information to the statistical models to generate better translations. Thus, construction of the aforementioned linguistic resources like bilingual parallel corpora and corpora enriched with syntactic and linguistic information plays a key role in building better SMT systems. SMT work for the English-Hindi languages are limited and have been resource deprived, especially for the Hindi language. The prime motivation for the research in this thesis is to bridge the gaps in the level of resources available for Hindi language and the English-Hindi language pair, from an SMT per- spective. The initial experiments at building a Hindi-English SMT system with Part-Of-Speech (POS) augmented models yielded unfavourable results, with translations that are no better than the plain statis- tical models. Root cause analysis revealed fundamental flaws like lack of better bilingual corpora and better syntactic models, which hinders the improvements to SMT system, thus setting the goal of this thesis towards addressing these issues. Firstly, we start by presenting several parallel corpora for English$Hindi and talk about their na- tures and domains. We also discuss briefly a few previous attempts in MT for translation from English to Hindi. The lack of uniformly annotated data makes it difficult to compare these attempts and pre- cisely analyse their strengths and shortcomings. With this in mind, a standard pipeline to provide uni- form linguistic annotations to these resources using the state-of-art NLP technologies is presented. The benchmark scores for various English!Hindi SMT systems are constructed using the aforementioned parallel corpora. A total of 1,37,578 sentences were cleaned and MT systems were benchmarked as part of this work. Subsequently which, this thesis talks about several digital text sources from different domains in English and Hindi languages which were processed with the intention of extracting parallel sentences. We faced several difficulties while extracting text content from different documents, especially the non- linear presentation of text in varying page layouts. We focus on this problem of linearizing the text from these digital sources by proposing a clustering based algorithm to eliminate the noise. The effort undertaken helped us process the raw sources available in different formats, different fonts and varying page layouts to produce 79,422 English and 68,584 Hindi sentences with comparable meanings across both languages. Lastly, a semi-automated framework is presented to enable speeding up dependency annotation task for corpora. We talk about the slow game of pass-the-parcel in between both the automation and hu- man sides which finally results in raw corpora being transformed into dependency annotated sentences. The methods of automation chosen to overcome the laborious and time-consuming process of corpora annotation are discussed. Along side which the errors and multiple analyses that result through the task of annotation and ways to recover are also discussed in detail. A total of 20,968 dependency annotated Hindi sentences and 7,120 Urdu sentences are created using the mentioned framework. Overall, the thesis tries to present three different efforts undertaken to quench the thirst for resources in the field of English$Hindi SMT.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Parallel Corpora and Linguistic Resource Creation for Statistical Machine Translation

Abstract