Applications and Resources for Telugu Code-mixing

Author: SRIRANGAM VAMSHI KRISHNA
Date: 2020-07-27
Report no: IIIT/TH/2020/58
Advisor:Manish Shrivastava

Abstract

The recent surge of data and trends in machine learning and deep learning helped to increase our understanding of language and thereby resulting in many applications. But the same is not true for many under resourced languages. Telugu is one such low resource Indian language. This work aims to shed light on the research of Telugu language. We present corpora and their analysis in the areas of Conversational Dialogue Systems, Named Entity Recognition and Emotion Prediction respectively. We present a Telugu conversational corpus, the first ever corpus to the best of our knowledge. We have built an end-to-end dialogue system using the corpus and performed various experiments with sequence to sequence encoder and attention decoder model involving word order, translation, vocabulary size, transliteration and word representations. The second is a Telugu-English code-mixed social media corpus for Named Entity Recognition(NER), the first ever corpus to the best of our knowledge. We have experimented with traditional machine learning methods such as Conditional Random Fields(CRFs), an undirected probabilistic graphical model, Decision Trees and also using a deep learning method, Long Short term Memory Networks(LSTMs). We have proposed feature functions for Named Entity Recognition which were used in the CRF. We reported an F1-score of 0.96, 0.94 and 0.95 with CRFs, Decision Trees and Bidirectional LSTMS respectively. The third is a Telugu-English code-mixed social media corpus for Emotion prediction, the first ever corpus to the best of our knowledge. We have proposed feature functions for Emotion Prediction which were used in an experiment with Support Vector Machines(SVM). We have also experimented with deep learning methods such as Long Short term Memory Networks(LSTMs) and Bidirectional LSTMs. SVM, LSTMs and Bidirectional LSTMs reported an accuracy of 58%, 60.92% and 70.74% respectively

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Applications and Resources for Telugu Code-mixing

Abstract