Towards Domain Adaptation for Hindi - Telugu Machine Translation

Author: Hema Ala
Date: 2022-05-28
Report no: IIIT/TH/2022/58
Advisor:Dipti Misra Sharma

Abstract

Since Neural Machine Translation (NMT) performs better than traditional statistical machine translation (SMT) models, it has become prevalent in recent years. NMT systems need a large amount of training data and perform poorly relative to Phrase-Based Machine Translation (PBMT) systems in low resource and domain adaptation scenarios. One of the challenges in NMT is domain adaptation. It becomes more challenging for scarce resource Indic languages, and technical domains like Artificial Intelligence(AI) and Chemistry as these domains may contain many technical terms and equations, etc. In a typical domain adaptation scenario like ours, we have a large amount of out-of-domain bilingual training data for which we need to train an NMT model, and we can treat this as a general model. Now given only an additional small amount of in-domain data, the challenge is to improve the translation performance on the new domain. We present two kinds of approaches for domain adaptation in this thesis: data-based and model-based, which have their own advantages and disadvantages. From the experiments, it was observed that the data-based approach is performing significantly better than the model-based approach in our case. Even though the translation performance on domain data is lower for the model-based approach than the data-based approach, it is higher than the general model. Based on this, we want to make two critical points of this thesis, 1. Suppose a large amount of general corpus is available. In that case, we can choose a data-based approach where the new model is trained on combined data(general domain and in-domain data), 2. if there is no such kind of parallel data, but there is a general model with some configurations. Then we can choose a model-based approach where it does not require any general data. However, it uses the available general model and continues the learning with new domain data. As we have general parallel data available and the first mentioned approach outperformed the model-based approach, we continued different experiments using the data-based approach itself. The domains are Chemistry and Artificial Intelligence in our work, and the language pair is Hindi and Telugu. As the mentioned domains can be treated as technical domains containing many domain-specific terms and equations, there is a need to improve the domain term translation. Therefore, we used a domain dictionary as parallel data and trained it along with training data. We created a domain dictionary for the chemistry domain for Hindi-Telugu. Using an automatic domain term extraction algorithm, we extracted the domain terms. First, we extracted domain terms in English and manually converted them into Telugu and Hindi. Using the domain dictionary as parallel data, the translation performance improved significantly in terms of domain terms and BLEU. The domains we adopted in this are similar in terms of vocab overlap. Hence, we used a combination of domains to improve the translation. The addition of different domain data also matters in the performance. We also showed how general data is powerful to complete the translation of a domain-specific text. We showed the order of combining the domain data sets and how it impacts the overall translation. In this work, we mainly concentrated on the data-based approach where we combine the available tiny amount of domain data to a large amount of general data. Combining only a small amount of domain data yields good performance, but it can be improved with backtranslation. The trivial back translation won’t help here because the noisy synthetic data may prevent the model from producing domain-specific translation properly. Therefore we need a practical approach that deals with this domain-specific data. We address the same issue with an algorithm called ”Domain-Specific Back Translation.” Using this algorithm we achieved a significant improvements in BLEU scores

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards Domain Adaptation for Hindi - Telugu Machine Translation

Abstract