Towards Understanding Code-Mixed Telugu-English Data

Author: jittadivya.sai
Date: 2018-05-12
Report no: IIIT/TH/2018/21
Advisor:Radhika Mamidi

Abstract

Code-Mixing (CM), a progeny of multilingualism, is defined as a phenomenon where linguistic units such as phrases, words and morphemes of one language are embedded within an utterance of another language. This phenomenon is often observed in conversations within the bilingual and multilingual user group. Also, Code-Mixed language is extensively used on social media sites like Facebook and Twitter. Though code-mixing is the most natural form of conversation both in speech and text (online chats), the current dialog systems and search engines are not capable of handling this kind of social interaction. Many Popular virtual personal assistants used by a significant amount of smart phone users are unable to handle the mixing and switching between any two languages, which occurs very naturally to any bilingual or multilingual person. So, as mentioned above, it becomes extremely important to understand CM for both information extraction and for the purpose of building dialog systems that are capable of social interaction. In this thesis we take the first step towards understanding CM between two languages - Telugu and English, belonging to two different language families, sharing no ancestry at all. We start by building basic preprocessing tools like Language Identification (LID) models, POS taggers for CM data. We collected Telugu-English code-mixed social-media blog based data. The collected data was annotated by two individual annotators and a Kappa score of 88.91 is reported, which is a sign of reliable data. We experimented with various machine learning algorithms and finally we present a Multiple Layer Perceptron (Neural networks) LID model that uses character n-gram vectors augmented with other handcrafted features. Our system gives an F1-score of 97 for English and 96 for Telugu. The LID module has been used for the task of POS tagging of Telugu-English CM data, which produced an accuracy of 52.37%. In the second part of the thesis we made an attempt towards understanding code-mixing in dialog (richest and most natural form of language) through automatic recognition of dialog act (speaker’s intention) of an utterance. We have experimented with learning algorithms like Support Vector Machines, Naive Bayes, Kth Nearest Neighbor and Hidden Markov Models. Our best system that gives an F1-score of 72.30 is HMM based. Non-availability of annotated conversational code-mixed data poses problems and hinders us from adopting data-driven methods like Neural networks. Manual procurement and annotation of code-mixed data is not only time-consuming but also is labor intensive. Therefore, we also investigate on how knowledge extracted from resource rich languages could be useful in dialog act recognition of code-mixed conversations. We show that a decent accuracy of 66% can be obtained by transforming the code-mixed utterances into a sequence of English words by the use of modules like LID, transliteration and translation. We propose that in the future, a variant of transfer learning can be used to incorporate this knowledge along with the knowledge obtained from manually annotated code-mixed conversations.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards Understanding Code-Mixed Telugu-English Data

Abstract