Sentiment Analysis for Telugu Language

Author: Muku Sandeep
Date: 2017-12-21
Report no: IIIT/TH/2017/94
Advisor:Radhika Mamidi

Abstract

Of late, sentiment analysis has transpired to be a critical cog in a slew of diverse Natural Language Processing tasks like recommendation systems, question answering, and business intelligence products to name a few. At its core, sentiment analysis is the process of analyzing emotions in a given piece of text, where the excerpt at hand is predominantly classified as carrying either a positive or a negative polarity. While the study of sentiment analysis in English has been proliferating, to say the least, the task has hardly been probed in Telugu. In this thesis, we attempt to get down to the brass tacks of sentiment analysis for the language popularly known as the Italian of East. While sentiment analysis in itself poses tremendous challenges, an additional problem in Telugu, that has restrained people from conducting active research, is the availability of appropriate data. We thus address the latter first, to ensure that the former problem is tackled more effectively. To address data availability, using a set of well-defined annotation guidelines, we first manually annotate Telugu texts with either positive, negative or neutral tag. The annotated data is then used to carry the sentiment analysis task, where a variety of machine learning algorithms are employed to handle both binary (only positive and negative tags) and ternary (positive, neutral and negative tags) sentiment analysis classification problems. Our experiments reveal that while a tad complex classifier like Random Forests can rule the roost in the case of binary classification task, a simple classifier like logistic regression can come up trumps should the task be framed as a ternary classification problem. It is a known fact that learning algorithms improve with an increase in the amount of training data. However, manually annotating data is practically difficult and expensive. To address these issues, we employ a novel approach of Hybrid Query Selection Strategy to increase the annotated data, where a set of classifiers are adopted to annotate data. Experiments show that our approach effectively achieves this with a minimal error rate. We also use active learning to classify sentences according to their polarities, thereby achieving our end goal.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Sentiment Analysis for Telugu Language

Abstract