Towards Identifying Humor and Author’s Gender in Code-mixed Social Media Content

Author: Ankush Khandelwal
Date: 2019-05-02
Report no: IIIT/TH/2019/32
Advisor:Manish Shrivastava

Abstract

In bilingual and multilingual communities, people often use a fused lect or mixed language while communicating with each other. This intermixing of languages during communication is called Code- Mixing and it is emerging as a new variation in the field of language. Various terms are even given to the mixture of languages, for example the mixture of Hindi and the English language is called Hinglish or the mixture of French and English is called Franglais in layman terms. Social Networking platforms like Twitter and Facebook gives individuals a stage to express their perspectives and to impart their thoughts everywhere throughout the globe openly. On these platforms the communication take place via short texts often using multiple languages in a single text. In the past decade, there has been an exponential growth in social media users. This gives rise to an immense amount of user-generated data which consists of Twitter and Facebook posts. A considerable amount of this data consists of code-mixed texts as users from bilingual and multilingual communities often type in the same way as they communicate in an actual conversation. These type of texts makes the text classification a challenging task in the field of natural language processing. Hence, there is a need to study the computational aspects of text processing of code-mixing. But due to the inadequacy of annotated datasets, it is difficult to carry out such studies. In this thesis, we study code-mixing of Hindi and English language in Twitter posts by addressing two automatic text classification problems. First is identifying humor in code-mixed texts. Humor detection is one of the difficult tasks in the field of computational linguistics. An in-depth semantic understanding of the text is required to detect humor which makes the problem difficult to automate.Humor detection has been studied widely on monolingual texts but it lacks resources in the field of code-mixed texts. Then we study the problem of gender detection of the author in tweets which is a part of Author Profiling. Author profiling is the problem of automatically determining profiling aspects like the author’s gender and age group through a text. It is gaining popularity in computational linguistics as in the present digital world it is quite easy to make a fake profile with false name, gender, age and location which can be harmful for the users as well as to the community as a whole. In this work, We present for the first time a corpus containing Hindi-English code-mixed tweets annotated with author’s gender and presence of humor in the tweet. We construct a corpus containing 7558 code-mixed tweets in Hindi and English and describe the process of tweet annotation in detail.Furthermore, We develop a baseline classification system for the two problem discussed above which is based on a classification model tested on a dataset which consist of about 1.5 million Spanish election tweets in which a tweet is to be classified in one of the four pre-defined classes. We use character and word level features to build a system which uses Kernel SVM as the classifier and perform 10-fold cross validation to evaluate our classification system.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards Identifying Humor and Author’s Gender in Code-mixed Social Media Content

Abstract