Towards Understanding Code-Mixed Social Media Text

Author: Sakshi Gupta
Date: 2016-09-14
Report no: IIIT/TH/2016/55
Advisor:Radhika Mamidi

Abstract

The Web 2.0, through its different platforms such as blogs, social networks, microblogs, or forums allows users to freely write content on the Internet, with the purpose to provide, share and use information. It is known that these type of platforms are among the top visited websites1, and their interest is growing more and more. Given the important role that social media usage is increasingly playing in daily life, a growing body of literature has emerged in the research community that aims to mine social media content, or to evaluate the linguistic aspects of that content in order to better understand its dynamics (Golder et al., 2011) [22]. This user-generated content has a huge drawback, which is the informal nature of the language used. Non-standard abbreviations, contracted spelling variations, casual grammatical structure are just some of the aspects of social media language. Over the past few decades, sociolinguists have been interested in a phenomena called “code mixing”, which has been observed in social media data. Code-Mixing refers to the embedding of linguistic units such as phrases, words or morphemes of one language into an utterance of another language. It is frequently seen in user generated content on social media like Facebook and Twitter, especially by multilingual users. Apart from the inherent linguistic complexity, the analysis of code-mixed content poses complex challenges owing to the presence of spelling variations, transliteration and non-adherence to a formal grammar. Due to the presence of such data all across social media, there is also a need to understand it. For any downstream Natural Language Processing task, tools that are able to process and analyse codemixed data are required. The first steps to understanding this data are language identification and word normalisation systems, so that we obtain the standard form of a crude sentence from social media. In this thesis, we have developed a system for language identification and word normalisation for Hindi-English code-mixed social media text (CMST). We have provided annotation guidelines for our system, after analysing the complex nature of the dataset used. Using this system, we have released a dataset of 1446 code-mixed Hindi-English sentences along with the associated language and normalisation labels. To the best of our knowledge, our work is the first attempt at the creation of an annotated linguistic resource for this language pair, which is also made public. We have also performed experiments with shallow parsing, in an attempt to build a complete pipeline from raw data to shallow parsed data. Our pipeline consists of 4 modules - Language Identification, Normalisation, POS Tagging, and Shallow Parsing. As far as we understand, we are the first to attempt shallow parsing on code-mixed social media text. This system has been released online.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards Understanding Code-Mixed Social Media Text

Abstract