IIIT Hyderabad Publications
Understanding Topics and Sentiment from Social Media
Author: Satarupa Guha Guha
Report no: IIIT/TH/2016/70
With the ever-increasing influence of social media in our lives, it is important to automatically make sense out of the enormous amounts of social media data generated each day and leverage it for different applications. A politician trying to gauge her chances of winning the election, the government reaching out to understand the grievances of the citizens regarding various administrative issues, a company attempting to promote its products and perceive its market reach - all of them have one thing in common, their destination, i.e. the social media. Even a few years ago, a restaurant interested in customer feedback would have to run manual surveys, requesting customers to fill out lengthy feedback forms. Not only did this approach have limited reach, it also had the additional problem of customers not being responsive. Now, with the advent of social media, this process had become much less cumbersome, with people reporting their experiences openly and unguardedly, thereby minimizing the cost incurred to the companies many folds. However it is challenging to mine information from the huge amount of user generated unstructured social media data. Understanding this data entails primarily identifying, understanding and analysing the topics discussed in social media along with the opinions expressed towards those topics. In this thesis, we focus on this problem, by first treating the two tasks of topic and sentiment identification independently, before eventually presenting a joint model. Topic identification in social media posts have been widely studied in the recent past. In contrast, we direct our attention towards capturing the topics in the conversations that arise out of interactions between users on social media. Specifically, given a conversation on Twitter, we aim to automatically recommend a relevant tweet, treating the conversation as a tree. Existing solutions to this problem exploit similarity of a candidate tweet to a single tweet, or to past tweets in a user-pair conversation. In this work, we generalize the problem setting to recommend a tweet considering the context from an entire conversation tree which often includes tweets from multiple users. While this setting is more natural, it brings in additional challenges: (1) how to choose an anchor tweet node from the conversation tree for which a new tweet can be recommended as a reply? (2) how to choose the tweet to recommend? We learn regression models with novel features to address both the challenges, and use them to perform extractive response recommendation. The first regression model predicts the time required by a tweet node to get its first child node, while the second predicts the number of retweets received by a tweet, as a measure of its popularity and acceptability, and hence quality. Experiments with millions of tweets show that the proposed recommendation method is more accurate compared to the state-of-the-art approaches,with respect to ground-truth labels. Due to lack of manually annotated data, we have used proxy signals to infer labels and used them as ground truth. Sentiment Analysis is another important task that complements topic identification in order to capture actionable information from social media. While sentiment analysis has been studied widely, it is not a completely solved problem when it comes to the noisy nature of the social media data. In this work, we introduce some novel features and learn supervised models to improve performance in the area of sentiment analysis in Twitter. With careful feature ablation experiments, we show which of the novel features contribute to the performance improvement of our system and to what extent. It is a simple model that performs competitively with respect to state-of-the-art systems and can be quickly and easily prototyped as an end-to-end system from scratch. But, one drawback of this work is that we have not taken the topics into consideration. Given a piece of text, we assign a sentiment label to it, irrespective of the topics being discussed in that text. This is especially problematic, when multiple topics are at play and the sentiment expressed throughout the text is not uniform. In order to alleviate this shortcoming, we further aim to learn topics and sentiment together. However, it is still a pipeline model instead of a joint one. In this model, we want to overcome the drawbacks of topic-agnostic sentiment analysis and tackle the problem which is formally known as Aspect based Sentiment Analysis, in which the goal is to identify the sentiment associated with each of the aspects being discussed in a post. First the aspects or topics being discussed in the text are identified and we treat this sub-task of aspect category detection as a multi-class multi-label classification problem. Instead of assigning a single generic sentiment label to a piece of text, our goal here is to associate a sentiment label to each of the aspects or topics being discussed in the text. This sub-task is called sentiment polarity classification. We also additionally address the problem of identifying the term in the text that represents each of the aspects/topics. This problem is also called opinion target expression identification. Specifically, dealing with restaurant reviews, we show that our proposed supervised approach performs well when compared to other state-of-the-art approaches for each of the sub-tasks mentioned above. The drawbacks of having a pipeline model is that we fail to leverage the synergy of the two tasks. Also until now, we have discussed primarily supervised approaches. So, finally, we introduce the task of joint modelling of topics and sentiment that leverage weak supervision. A huge amount of data is generated everyday on social media, encompassing a wide range of topics. With many business decisions depending on customer opinion, mining of social media data needs to be quick and easy. For a data analyst to keep up with the agility and the scale of the data, it is impossible to depend on fully supervised techniques to mine topics and their associated sentiments from social media. Motivated by this, we propose a weakly supervised approach (named, TweetGrep) that lets the data analyst easily define a topic by few keywords and adapt a generic sentiment classifier to the topic – by jointly modelling topics and sentiment using label regularization. Experiments with diverse datasets show that TweetGrep beats the state-of-the-art models for both the tasks of retrieving topical tweets and analyzing the sentiment of the tweets (average improvement of 4.97% and 6.91% respectively in terms of area under the curve). This model indirectly contributes towards domain-adaptation of a genericsentiment analyser. Further, we show that TweetGrep can also be adopted in a novel task of hashtag disambiguation, the goal of which is to retrieve the tweets related to the original sense of a hashtag, from a set of tweets that contain the hashtag in question. Our proposed approach works significantly better than the baseline models for this task as well. In this thesis, we have made an attempt to target the broad problem of topic and sentiment identification from the noisy user-generated social media data. We have chosen a few challenging sub-problems within this wider scope and presented robust solutions. Through extensive experiments, we have shown that our approach performs better when compared with existing works.
Full thesis: pdf
Centre for Search and Information Extraction Lab
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved.