IIIT Hyderabad Publications |
|||||||||
|
Better Understanding of Code-Mixed Social Media Data via Information ExtractionAuthor: Sumukh S Date: 2023-10-14 Report no: IIIT/TH/2023/163 Advisor:Manish Shrivastava AbstractCode-switching or code-mixing occurs when ”lexical items and/or grammatical features from two languages appear in one sentence”. With the rising popularity of social media platforms such as Twitter, Facebook, and Reddit, the volume of texts on these platforms has also grown significantly. Twitter alone has over 500 million text posts (tweets) per day. India, a country with over 300 million multilingual speakers, has over 23 million users on Twitter as of January 2022, and code-switching can be observed heavily on this social media platform. Code-mixed social media posts present unique challenges due to the mixing of languages, slang, and informal expressions. Extracting valuable information from code-mixed data enables a better understanding of user sentiments, preferences, and trends across different language communities. It helps identify named entities, detect events, and extract relevant information from multilingual conversations. By effectively extracting and analyzing code-mixed content, businesses, researchers, and language processing systems can gain valuable insights into diverse language patterns, linguistic phenomena, and cultural nuances, contributing to more accurate language processing, sentiment analysis, and targeted communication strategies. Extracting valuable information from social media content has numerous benefits that impact various domains. This has led to a growing interest in Information Extraction (IE) as an active research area in artificial intelligence. However, research efforts are often hindered by the lack of automated NLP tools to analyse massive amounts of code-mixed data, especially for resource-poor languages. There have been significant efforts towards understanding some languages, such as English, German, French, and Spanish, by creating annotated resources for various tasks, while some languages have not received the same focus, making them resource-poor. One such language is Kannada, which has over 58 million native and secondarylanguage (L2) speakers worldwide. Like in most places around the world, Kannada speakers tend to use code-mixed language, mixed with English, in informal settings including social media and texting platforms, but the same has not received much focus from researchers. In this thesis, we work towards information extraction from Kannada-English code-mixed data from social media. We work on the problems of Named Entity Recognition (NER) and Event Detection. Named Entity Recognition (NER) and Event Detection on social media code-mixed data are crucial for understanding and analyzing user-generated content in diverse languages. They enable the identification of named entities, such as names of people, organizations, and locations, as well as the detection of events, allowing for deeper insights into multilingual conversations, cross-cultural trends, and effective information retrieval from code-mixed social media posts. We have collected Kanglish code-mixed data from social media, Twitter, and curated annotated datasets for both NER and Event Detection tasks while mentioning the annotation guidelines for the same. We have analysed the challenges that are unique to Kannada-English code-mixed data and have provided annotation guidelines for the same. We have also proposed a few supervised approaches towards these tasks on our dataset with careful feature selection and critically analysed the results in hopes to promote more focus from the research community on such low-resource languages. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |