Towards Dialogue Modelling in Code-Mixed Low Resource Language Settings

Author: Sai Vivek Nanduri 201425117
Date: 2022-06-27
Report no: IIIT/TH/2022/88
Advisor:Radhika Mamidi

Abstract

Natural Language Processing (NLP) has evolved drastically in many areas due to the popularity and success of a variety of different Machine Learning and specifically Deep Learning techniques and their applications commercially as well as academcially in the last decade or so. Typically most of the work done in these fields globally has centred around solving tasks with the use of resource rich languages such as English as a base upon which models have been built. The large chasm this technological and academic revolution brings about, to a world that is divided in a multitude of regional cultures and languages that haven’t evolved at the same pace, creates a very real obstacle. We aim to play a part in bridging this gap that has been brought about with the creation of a rich resource for one such low resource language in Telugu, a Dravidian language. We aim to focus on curating and annotating a dialogue corpus as it allows us to approach a variety of tasks and holds various different real world applications including the rather pertinent current phenomenon of improving Human Computer Interactions especially in the context of low resource languages such as Telugu. It is important to also take note of the evolution of language and different means of communication as a function of time and one product of this observation is the increased usage of code-mixing in colloquial discourse. In order to develop robust models and accurately recreate Human-Human interactions with the use of computers, we need to be able work with and model on code-mixed data. Therefore another focus of our study is in the creation of a dialogue corpus that is heavily code-mixed with Telugu and English and the subsequent modelling of the dialogue system using deep learning techniques. In order to determine the resourcefulness of the Telugu-English code-mixed dialogue corpus that we have carefully curated, we put it to the test by performing two different widely studied NLP tasks in : Text-based Speaker Identification and Automatic Humour Recognition. We provide a comprehensive analysis and survey of modern deep learning techniques for the task of text-based speaker identification and present a detailed explanation of all the choices made while undertaking this study. The deep learning techniques implemented in this body of work include the likes of Convolutional Neural Networks, LSTMs, and Transformer based models.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards Dialogue Modelling in Code-Mixed Low Resource Language Settings

Abstract