IIIT Hyderabad Publications
Towards Deep Semantic Analysis of Hashtags
Author: Piyush Bansal
Report no: IIIT/TH/2016/42
Microblogging services like Twitter enable communication at a massive scale. It has been recently reported1 by Twitter that every month, 248 million users access the microblogging platform, and create around 500 million posts everyday in more than 35 languages. This tremendous growth gives an opportunity to mine useful information that is being shared across social media. However, there exists a 140 character limitation on the posts (also known as “tweets”) that users can create on Twitter. This results in heavy use of emoticons, abbreviations, misspellings and has lead to various linguistic “innovations” that render the traditional text analysis techniques less effective. Another interesting aspect of such tweets is the usage of semantico-syntactical constructs called “hashtags”. Hashtags are “#”-prefixed keywords used by people in order to organise the meaning of their tweets. Also, hashtags enable classification of tweets, since the posts using same or similar hashtags are expected to be semantically related to each other. The challenge posed by hashtags is the fact that most hashtags are not simply “#”-prefixed keywords, but “#” symbol prefixed with concatenation of various words or phrases which are not space delimited. For example, consider the hashtag - “#NSAvsSnowden”. We observe that this hashtag is essentially “NSA vs Snowden”, which is not a single keyword, but a concatenation of various words. In this thesis, we discuss and compare various approaches in order to “segment” the hashtag into meaningful words. Also, our task extends beyond just the segmentation of hashtag - we present a unified framework to also perform “entity-linking” on various constituent entities in a hashtag. Entity Linking is an established IR task, where the goal is to extract latent semantics from plain text by linking the text to a knowledge base (KB) such as Wikipedia. Consider, for example, the following text - “Snowden reveals classified information from NSA”, we first need to identify various entities in this piece of text, followed by “disambiguating” them, and establishing a link between those entities and some knowledge base (KB) so that we have additional contextual information available about the concerned entity. This approach has been found to be instrumental in order to teach the meaning of text to machines, which is otherwise meant for human consumption. Hence, after performing entity linking on the segmented hashtag - “Snowden vs NSA”, we would have enriched the text with additional semantic information by establishing links between “Snowden” and corresponding Wikipedia page - “Edward Snowden”, between “NSA” and “National Security Agency”. “NSA”, in principle, could also refer to “National Sports Academy” or “National Security Act”, and this is exactly where “disambiguation” becomes important for mention-resolving, which employs contextual information to perform this task. Since hashtags are human curated labels associated to tweets, our premise is that segmenting and linking the entities present within the hashtags could therefore help in better understanding and extraction of information shared across the social media. Traditionally, most of the IR tasks have treated hashtags as either a single word, or have ignored them for all practical purposes. We demonstrate how extraction of semantics from tweets improved, when additional semantic information was made available by our system by segmenting and entity-linking hashtags. We demonstrate this by performing various experiments on NEEL Challenge Dataset, and a human annotated subset of Stanford Sentiment Analysis Dataset, which has also been made public to ease future research in this area. We have achieved the P@1 score of 0.914 on NEEL Dataset and 0.873 on the manually annotated Stanford Sentiment Analysis Dataset for hashtag segmentation and linking. We also showcase how our approach leads to improvements in the task of “Semantic Microblog retrieval” and “Semantic Hashtag retrieval”. Microblog retrieval refers to retrieval of a ranked list of microposts given a query Q. Hashtag retrieval, on the other hand is a relatively newer IR task. It basically refers to retrieving a ranked list of the top-k hashtags relevant to a user’s query Q. To retrieve information related to a user’s interest, for instance, “Rock concerts”, it’d be very helpful to the user if they can be suggested a list of hashtags which are commonly used in relation to “Rock concerts”. By tracking these hashtags, a user can gain information about rock concerts via the posted tweets. However, it’s not possible for the user to manually figure out all the hashtags that are used across Twitter, relevant to their interest. In this thesis, we also address this problem. In order to solve these two retrieval problems, we propose and discuss a virtual document structure, which we refer to as Semantically Enriched Microblog Document (SEMD), and experiment with various psuedo-blind relevance feedback mechanisms. We tested our approach on the publicly available Stanford sentiment analysis tweet corpus. We observed an improvement of more than 10% in NDCG for microblog retrieval task, and around 11% in mean average precision for hashtag retrieval task. We experiment with some interesting aspects of search engine evaluation in context of our tasks. Our work presents elaborate discussion on various aspects of hashtag analysis. In this manner, we hope that it drives fruitful research in various tasks of microblog IR in the future by drawing attention towards hashtags.
Full thesis: pdf
Centre for Search and Information Extraction Lab
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved.