Multilingual NMT for Indian Languages

Author: Sourav Kumar
Date: 2021-07-29
Report no: IIIT/TH/2021/80
Advisor:Dipti Misra Sharma

Abstract

There are many sub-fields of Artificial Intelligence of which Natural Language Processing (NLP) is the one which deals in providing the knowledge to computers to produce and understand human text and speech. Machine Translation (MT) is an application of NLP that focuses on the automatic translation between languages. It is crucial as it has enabled people all over the world to travel to various countries and interact with each other. With the recent advancement in deep learning, Neural Machine Translation (NMT) has shown indistinguishable translations from translations produced by humans for many language pairs, but driving factors behind the leaps in translation quality is availability of abundant parallel data resources which low resource languages lack like Indian languages. A lot of research has been done to improve the translation quality of low resource via exploiting monolingual data or parallel data involving other language pairs. Recently, Multilingualism has drawn much attention and is gradually becoming ubiquitous in the sense that more and more researchers have successfully shown that using additional languages helps in improving the translation quality. Indian languages are diverse, morphologically rich and use different scripts which make Translation tasks complex and challenging. But despite all this, Indian languages still share a lot of lexical features which we think can be utilized to improve the quality of translation systems. So, we performed a large scale case study on similarity among those languages by considering all the different factors that may impact its value. We have also proposed the techniques for efficient Multilingual Neural Machine Translation (MNMT) particularly for Indian languages mainly focusing on leveraging the lexical similarity of languages and efficient training in terms of time as well as computational resources. For this, we are performing a systematic incremental case study on MNMT where we are investigating the contribution of different languages during the learning process. Based on the facts and conclusion from our study, we devised an algorithm that will select the subset of languages for required for particular multilingual NMT settings. Also languages itself can also have multiple domains, making the available parallel data for particular domains very limited thus to address the problem of data scarcity, we propose the pipeline for efficient multilingual multi-domain systems. This is the first large-scale study specifically devoted to improve the MNMT for Indian languages by utilizing language relatedness to the best of our knowledge.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Multilingual NMT for Indian Languages

Abstract