Shallow Parsing Pipeline For Code-Mixed Social Media Text

Author: Arnav Sharma
Date: 2020-06-23
Report no: IIIT/TH/2020/56
Advisor:Dipti Misra Sharma

Abstract

Natural Language Processing (NLP) is a challenging field in the intersection of Artificial Intelligence, Computer Science, and Computational Linguistics, concerned with the interaction between natural languages and computers. The understanding of syntax is a vital aspect of natural language processing. With the advent of social media platforms and their popularity in the Hindi heartland, a vast amount of text is generated on such platforms. Traditional NLP tools trained over corpora written in standard grammar are ill-suited for such texts. The reasons for this are their unconventional grammatical structure, non-standard spellings, abbreviations, and phenomena called code-mixing. Code-Mixing refers to the embedding of linguistic units such as phrases, words, or morphemes of one language into an utterance of another language. It is commonly observed in the day-to-day language of multilingual speakers and their utterances on social media platforms. Hence, there is a need to develop tools specifically for Hindi-English Code-Mixed Social Media Text. In this study, the problem of shallow parsing, often a pre-requisite to full parsing, of Hindi-English code-mixed social media text has been addressed. To create the input data for a shallow parser, we have built a language identifier, a normalizer, and a part-ofspeech tagger as well. Along with the tools, we have built a multi-level annotated corpus available to the research community free from restrictions posed by content sourced from social media platforms.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Shallow Parsing Pipeline For Code-Mixed Social Media Text

Abstract