Hindi to English Machine Translation

Author: Kunal Sachdeva
Date: 2016-02-05
Report no: IIIT/TH/2016/5
Advisor:Dipti Misra Sharma

Abstract

Abstract Machine Translation (MT) is a task in Natural Language Processing (NLP), where the automatic systems are used to translate the text from one language to another while preserving the meaning of source language. In this work, we provide our efforts in developing a rule-based translation system on the Analyze-Transfer-Generate paradigm which employs morphological and syntactic analysis of source language. We utilized shallow parser for Hindi language along with dependency parse labels for syntactic analysis of Hindi language, developed modules for transfer of Hindi to English and generation of English language. Due to wide difference in word order of the two languages (Hindi following SOV and English SVO word order), a lot of re-ordering rules need to be crafted to capture the irregularity of the language pair. As a result of drawbacks of the aforementioned approach, we shifted to statistical methods for developing a system. A wide variety of machine translation approaches have been developed in past years. As each model has its pros and cons, we propose an approach where we try to capture the advantages of each system, thereby developing a better MT system. We then incorporate semantic information in phrase-based machine translation using monolingual corpus where the system learns semantically meaningful representations. Recent studies in machine translation support the fact that multi-model systems perform better than the individual models. In this thesis, we describe a Hindi to English statistical machine translation system and improve over the baselines using multiple translation models. We work on MOSES which is a free statistical machine translation framework, which allows automatically training translation model using parallel corpus. MOSES provides support for multiple algorithms for training a model. We propose an approach for computing the quality score for each translation by using automatic evaluation metric as our quality score. The computed quality score is used for guided selection among the translations from multiple models, thereby providing a better system. We have used support vector regression to train a model using syntactic,textual and linguistic features extracted from the source and target translation with evaluation metric as our regression output. Quality Estimation of Machine Translation is a task where the system tries to predict the quality of output on the basis of features extracted from source and target languages. The system dynamically(run time) computes a quality score corresponding to each translation, which is the measurement for the correctness of the output. Different from MT evaluation, quality estimation systems do not rely on reference translations and are generally addressed using machine learning techniques. The approach offers a great advantage to the readers of target language as well as for professional translators.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Hindi to English Machine Translation

Abstract