IIIT Hyderabad Publications |
|||||||||
|
Towards understanding People from Multilingual SocietiesAuthor: Deepanshu Vijay Date: 2018-09-26 Report no: IIIT/TH/2018/68 Advisor:Manish Shrivastava AbstractMicro-blogging sites such as Twitter, Facebook have gained tremendous popularity in the last decade, and have allowed the users to freely write content on the Internet for the purpose of sharing, providing and using the information. They encourage users to express their daily thoughts in real time, which often results in millions of emotional statements being posted online, everyday. This huge amount of user-generated data has resulted in the emergence of a research community that aims to mine social media content and evaluate various linguistic properties associated with the text, such as emotion prediction, irony detection, hate speech detection, gender prediction, sarcasm detection, fake news detection, troll detection etc. However, the writing style of users on microblogging sites tends to be quite colloquial and non-standard, different from the style found in more traditional, edited genre. Authors from multilingual societies tend to write code-mixed posts frequently on social media which results in emotions expressed in text due to the mixture of different natural languages. Code-Mixing (CM) is a natural phenomenon of embedding linguistic units such as phrases, words or morphemes of one language into an utterance of another. The presence of huge amount of code-mixed data on social media has attracted researchers to understand and study the code-mixed texts. While some work has been done on code-mixed social media text and in emotion prediction and irony detection separately, our work is the first attempt which aims at identifying the emotion and detecting the irony associated with Hindi-English code-mixed social media text. In this thesis, we analyze the problem of emotion identification and irony detection in code-mixed content. We present a corpus of Hindi-English code-mixed tweets for Emotion Prediction and Irony Detection. Corpus for Emotion Prediction is annotated with the associated emotion and also the causal language of the expressed emotion. We annotated the corpus for Irony Detection with the labels ironic or non-ironic. For every tweet in the datasets, we annotate the source language of all the words present. Finally, we propose a supervised classification system which uses various machine learning techniques for prediction of emotion and detecting the irony associated with the text using a variety of character level, word level, and lexicon based features. We evaluate our systems on the presented datasets and carry out 10-fold cross-validation. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |