Enhancing Sentiment Prediction and Bias Detection for Telugu Language across Multiple Domains using ML and Deep Learning

Author: Gangula Rama Rohit Reddy
Date: 2020-05-07
Report no: IIIT/TH/2020/30
Advisor:Radhika Mamidi

Abstract

The rapid increase of textual content on the internet has made efficient text processing very important for various applications ranging from product analysis and implementing business strategies to inculcating public opinion in governance processes or creating biased news to gain public support. In this context, Sentiment Analysis has become one of the important fields of research. Sentiment analysis can be used to know about public opinions expressed over various platforms using NLP and machine learning techniques. A lot of research has been done in the English language in these fields due to the availability of a large number of tools and resources. Though a significant amount of progress is being made in Indian languages in recent years, there is a lot to explore in Telugu language with the machine and deep learning techniques. This thesis aims to enhance the sentiment prediction and the detection of political bias aiming to alter the sentiment across people using various approaches. For this, the need to develop a corpus first was felt essential. For the development of sentiment classifiers in Telugu, we have created corpora “Sentiraama” for different domains like movie reviews, song lyrics, product reviews and book reviews in Telugu language with the text written in Telugu script as there were no such datasets available previously. We describe the process of creating the corpora and assigning polarities to them. Typically, sentiment classifier is trained using data from the same domain it is intended to be tested on. But there may not be sufficient data available in the same domain and additionally using data from multiple sources and domains may help in creating a more generalized sentiment classifier that enhances sentiment prediction. We showed how this generalized sentiment classifier can help in handling low domain-specific data scenarios. To improve sentiment analysis in a low resource language, sentiment labeled corpora are translated from English into the focus language and use them as additional resources for sentiment analysis research in the focus language. But when the text is translated from one language into another, the sentiment is preserved to varying degrees. We use product and book reviews in English as a stand-in for source language text and to determine the loss in sentiment and sentiment predictability, we used manually and automatically determined sentiment labels of the English text as a benchmark. We show that sentiment analysis of Telugu manual translations of English text produces competitive results w.r.t English sentiment analysis. We show that machine translation significantly reduces the human ability to recover sentiment. In the process, we created a Telugu-English parallel corpus that is independently annotated for sentiment using a 5-value scale by Telugu and English speakers. We also created a Telugu lexicon annotated at both sentiment and emphasis level. Another kind of text which is different from the ones we analyzed is ‘song lyrics’. Songs are important to sentiment analysis since the songs and mood are mutually dependent on each other. Song lyrics are a rich source of dataset that is helpful in the analysis and classification of sentiments generated from it. Nowadays we observe a lot of inter-sentential and intra-sentential code-mixing in songs which has a varying impact on the audience. To study this impact we created a Telugu songs dataset which contained both Telugu-English code-mixed and pure Telugu songs. We classified sentiment of the songs based on its arousal rather than routine valence level analysis and enhanced sentiment prediction by introducing code-mixing features. On the other hand, in the domain of news, media houses and journalists try to affect the sentiment or polarity of people towards some political parties by shrewd means such as misinterpreting reality and distorting viewpoints. This can be mainly observed during elections where media houses try to alter people’s sentiment by providing bias news. Detection of such biased news is very useful as unrestricted access to unbiased information is crucial for forming a well-balanced understanding. So along with sentiment analysis, we also aimed at detecting biased news that aims to cause sentiment shifts. As there was no existing dataset available in Telugu language, we created a dataset comprising of 1329 news articles collected from various Telugu newspapers and marked them for bias towards a political party. We also proposed a headline attention model which enhances the bias detection over various methods by a substantial margin.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Enhancing Sentiment Prediction and Bias Detection for Telugu Language across Multiple Domains using ML and Deep Learning

Abstract