IIIT Hyderabad Publications |
|||||||||
|
Using Domain Knowledge to Improve Machine Translation in Indian LanguagesAuthor: Akshat Gahoi 2018114012 Date: 2023-06-27 Report no: IIIT/TH/2023/93 Advisor:Dipti Misra Sharma AbstractIn this modern world, due to the increased mobility of humans, encountering a foreign language has become a common challenge for many people around the world. This causes a language barrier in their regular lives, which makes communication quite difficult. This makes machine translation a facility that helps people to overcome this language barrier. Research on machine translation has been going on for many decades, and there are many MT models that give good-quality translations, but even the best of these models fail to produce quality output when a domain-specific input is given to them. These models are trained on large general domain data, which makes their domain-specific translations not up to par. This brings up the issue of domain adaptation for different areas. For Indian languages, the issue arises with the lack of domain-specific data and good baseline models. This thesis will try to put forward an approach to improving the scores of domain-specific translations with efficient use of domain data. Before getting into domain adaptation, this thesis will try to understand how a domain is defined and how domain information is stored in these documents or sentences. For this study, we will discuss two tasks that will help us to understand the importance of domain terminologies. The first one discussed fine-grained domain classification as a task. It tries to get information out of similar domains and what makes those domains different by classifying an unknown document into a similar set of domains. The other task helps to find domain terms in a document in an unsupervised manner. It used an improved TextRank approach, where n-grams are used to get the most important terms in a document. Both of these approaches helped in understanding domain terms and their importance in defining a domain. After understanding domains, we detail different approaches done for domain adaptation and give a comparative analysis of them. We started with a very basic domain adaptation approach that gave us a good result but proved to be an inefficient task for multiple domains. All the approaches were for the English-Hindi language pairs, but the basic domain adaptation of individual domains was also done for the English-Telugu and English-Bengali language pairs. After multiple experiments, we show in this thesis how we can get better performances in an efficient manner when we use the domain knowledge of different domains in the task of domain adaptation. Different approaches will be discussed for all the domains to create different translation models for our task of domain adaptation. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |