Sentiment Analysis For Hindi Language

Author: Piyush Arora
Date: 2013-07-23
Report no: IIIT/TH/2013/32
Advisor:Vasudeva Varma

Abstract

Sentiment Analysis is an area of focus over the last decade. Increase in user-generated content provide an important aspect for the researchers, industries and government(s) to mine this information. The user-generated content is one important source for various organizations to know/learn/identify the general expression/sentiment of different users on the product. In this work, we focus on mining sentiments and analyzing them for Hindi language. Hindi is the 4th commonly spoken language1 in the world. With the increase in the amount of information being communicated via regional languages like Hindi, comes a promising opportunity of mining this information. Mining sentiments in Hindi comes with their share of issues and challenges. Hindi is morphologically rich and is a free order language as compared to English, which adds complexity while handling the user-generated content. The scarcity of resources for the Hindi language brings challenges ranging from collection and generation of datasets. We take up this challenge and work towards building resources- reviews, blogs annotated corpora and subjective lexicon for Hindi language. We propose a technique to build a subjective lexicon given a pre-annotated seed list for a language and its wordnet representing the network/connectivity of words using synonyms and antonyms relations. One of the salient features of this technique is that the method can be applied for any language which has the wordnet available. To show the applicability of the technique on other languages, we experiment our technique on English language in addition to the Hindi language. The lexicon generated by our algorithm is evaluated using the following different metrics. 1. Comparing against Manual Annotation 2. Comparing against similar Existing Resources 3. Classification Performance In addition to resource creation, we take up the task of sentiment classification in Hindi Language. We work on two different genres of Hindi User- Generated Web Content- 1. Reviews 2. Blogs For both of these genres we present three different approaches for performing sentiment classification such as 1. Using Subjective Lexicon 2. N-Gram Method 3. Weighed N-Gram We aim at analysing the merits and demerits of each of the above approaches across the different genres for the sentiment classification task. We discuss in detail the problems and the issues while working with the user-generated content (reviews and blogs) in Hindi language. This research work, throws some light on the main differences between the User-Generated Content in English and Hindi language at linguistic and its representation level and the approaches followed to address the same. English language provides the option of leveraging the abundant resources and tools that have been developed in the past, the work for Indian Languages has just began since last decade and it is in early stage of research and development, so our focus has been on- “To effectively mine the subjective information from the user-generated content in Indian languages, overcoming the data scarcity challenges associated with such problems.”.

Full thesis: pdf

Centre for Search and Information Extraction Lab

IIIT Hyderabad Publications

Sentiment Analysis For Hindi Language

Abstract