IIIT Hyderabad Publications |
|||||||||
|
Towards Building a Shallow Parsing Pipeline for English-Telugu Code Mixed Social Media DataAuthor: Kovida Nelakuditi Date: 2017-12-13 Report no: IIIT/TH/2017/82 Advisor:Radhika Mamidi AbstractInternet is growing more with time and the popularity of social media is growing at a faster rate. There is huge growth of users using informal language on the internet as opposed to formal language. In a country like India where most users are bilingual or multilingual, the content available on social media is also code-mixed. Code mixing refers to the embedding of linguistic units such as phrases, words and morphemes of one language into an utterance of another language [26]. Social media stands as one of the closest available mirrors to the real world. It shows how users react to current events and engage with each other. This data provides evidence, objectivity and hidden insight to the world, which can be used for various purposes. Given the many benefits of social media understanding, it is important to develop tools useful for processing such data. There is a large body of work done on code mixed social media data by the research community all over the world but very less with respect to Indian languages. There is no work done on English- Telugu pair so far. This thesis describes the work done towards building a shallow parser pipeline for English-Telugu code mixed social media data as collected from Facebook. The data is annotated with language labels, normalized form of the word, POS tag of it and chunk level information for 10,207 words. Shallow parsing can be used as a pre processing tool for many Natural languge processing applications like summarization of posts/replies on social media, multi-lingual machine translation on microblogs, in building user friendly human computer interaction systems etc. Shallow Parsing pipeline consists of four modules, namely, Language Identification, Normalization, POS Tagger and chunker. Language Identification is the process of Identifying the language of every token in the text. This is the foremost step for any further processing of code-mixed text. Normalization is the process of translating the data into the standard form. Part-of-Speech Tagging is a primary and an important step for many Natural Language Processing Applications. POS Taggers and Shallow Parsers have reported high accuracies on grammatically correct monolingual data. POS tagging is considered as a classification problem and we use different classifiers like Linear SVMs, CRFs, Multinomial Bayes with different combinations of features which capture both context of the word and its internal structure. The work on experiments with combining monolingual POS taggers for POS tagging of this code mixed English-Telugu data is also reported. Chunking is also treated as a classification problem and we use various combinations of features to train the learning model. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |