Predicting User Attributes from Social Media

Author: Santosh Kosgi
Date: 2017-03-02
Report no: IIIT/TH/2017/17
Advisor:Vasudeva Varma

Abstract

For privacy reasons, personally identifiable information like age and gender of people is not available publicly. However accurate prediction of such information has important applications in the fields of advertising, forensics and business intelligence. For instance, from a forensic linguistics perspective, being able to determine the linguistic profile of the author of a suspicious text solely by analyzing the text could be extremely valuable for evaluating suspects. Similarly, from a marketing viewpoint, companies may be interested in knowing, on the basis of the analysis of blogs and on-line product reviews, what types of people like or dislike their products. In this thesis we devise methods to predict age, gender and profession of authors from social media. Firstly we consider the problem of predicting age and gender of blog authors. We tried to exploit differences in writing style and content between male and female blogger’s as well as among authors of different ages to determine an unknown authors age and gender. N-gram words have been used as content based features and POS n-grams have been used as style based features. N-gram based features model top words used by different categories of people. But many times same words are used in different contexts. For example, males use dresses in context with trousers and coats whereas females use dresses with words like bridal wears and gowns etc. To capture these, we also use topic based features. Topic based features consider the fact that different categories of people have different topic of interests. We tried to model these differences to predict age and gender of the person. We used LDA [9] algorithm to find topics from the blog and propose a method which uses a combination of above mentioned features to predict unknown author’s age and gender by solely analyzing contents of the blogs. Existing methods for predicting age and gender of blog authors have focused on classifier learning using content based features like word n-grams and style based features like Part of Speech (POS) ngrams. Two major drawbacks of previous approaches are: (1) they do not consider the semantic relation between words, and (2) they do not handle polysemy. In this thesis, we propose a novel method to address these drawbacks by representing the document usingWikipedia concepts and category information. Experimental results show that classifiers learned using such features help us achieve significantly better accuracy compared to the state-of-the-art methods. Indeed, feature selection shows that our novel features are more effective than previously used content based features. In recent years, the microblogging services like Twitter has become a major tool for sharing events, expressing opinions and communicating with friends. Several thousand of microblogs are posted each second describing the ongoing events around the world. Twitter has large number of users with varied age, gender and professions. Because of this recent rise in the popularity and size of social media, there is a growing need for system that can extract useful users information from social media. People in advertising and marketing, try to find certain characteristics of users to target. Thus, finding characteristics (attributes) of Twitter users is of high importance. Unlike other social networking sites like Facebook and MySpace, Twitter has limited information about its users, making the task difficult. In response, this work sets out to predict automatically age, gender and profession of Twitter users. Existing methods for this problem have focused on classifier learning using the features related to user such as Socio- Linguistic, Content based and Network structure based. Most of the current systems did not consider the effect of these attributes on how other users interact with a user. In this paper, we try to exploit mention tweets and list membership information of Twitter users to predict age, gender and profession along with the tweet content. Our experiments demonstrate a high overall accuracy in detecting age, gender and profession as well as analysing the pros and cons of various types of features for particular attribute of user.

Full thesis: pdf

Centre for Search and Information Extraction Lab

IIIT Hyderabad Publications

Predicting User Attributes from Social Media

Abstract