Information Extraction and Aggression Detection on Bilingual Social Media Texts

Author: Vinay Kumar Singh Singh
Date: 2019-06-28
Report no: IIIT/TH/2019/70
Advisor:Manish Shrivastava

Abstract

With the evolution of social media, the data present online now is more diverse than ever. The tremendous popularity of social media micro-blogging websites like Twitter, Facebook has provided users more freedom to post their content on the Internet without much restrictions. One effect of which is that users are free to express themselves in real time over the entire world network. This all adds a considerable amount of data to the already existing enormous internet, which attracts the need for research in this area. Hence, the need for a research community to mine and extract any useful information if possible from these social media texts. With that massive amount of data we need processing or methods in times when we want some answers to some specific question and at the same time save time and not crawl the entire Internet uselessly, like Named Entity Recognition. Apart from that we also try to extract morphological properties from the data available to us from these micro-blogging sites like aggression, sarcasm, gender, etc. One of the benefits that these social sites provide to their users is the availability to write content in their native languages as well as in code-mixed texts which brings the research community to focus on these low-resourced languages/texts. In this thesis, we analyze the problems of Named Entity Recognition (NER) and Aggression Detection in code-mixed social media texts and provide with a baseline system for these two problem statements which are not very well explored. We present a corpus of Hindi-English code-mixed tweets for Named Entity Recognition which is annotated following the BIO standard (Begin, Intermediate and Other). For the task of Aggression Detection, we used Hindi-English code-mixed dataset provided for the shared task in 1st Workshop on Trolling, Aggression and Cyberbullying (TRAC-1). Finally, we propose a supervised classification system which uses various machine learning techniques for prediction of Named Entity tags of the words in NER task and detecting Aggression associated with the text using a variety of features.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Information Extraction and Aggression Detection on Bilingual Social Media Texts

Abstract