IIIT Hyderabad Publications |
|||||||||
|
Analysis and Generation of Code-mixed DataAuthor: Dama Sravani Date: 2023-07-04 Report no: IIIT/TH/2023/117 Advisor:Radhika Mamidi AbstractCode-mixing (CM) is a common linguistic phenomenon, especially in multilingual communities, where speakers blend two or more languages or dialects in a single sentence or conversation. With English being the second language in India, code-mixing has become a predominant practice, particularly in urban areas. It is common for a Hindi or Telugu speaker to switch to English in the same sentence or utterance. The surge in code-mixing can be attributed to the rise of social media, which has provided a platform for people from diverse linguistic backgrounds to communicate. In such contexts, code-mixing has become an effective way to express oneself using words and phrases from multiple languages. Code-mixing has been of interest to sociolinguists for a long time. Studying its functional form from a sociolinguistic perspective has become a significant phenomenon along with its linguistic structures. In a multilingual country like India, politicians use language and linguistic mechanisms to establish relationships with people. Our study focuses on using code-mixing in Telugu political speeches to understand the factors responsible for their usage levels in various social settings and communicative contexts. As part of our analysis, we have compiled a set of rules that capture dialectal variations between the Standard and Telangana dialects of Telugu. Building NLP systems for code-mixing can help analyze sentiment and conversations in social media. It can also facilitate machine translation, helping people who converse in multiple languages to communicate more efficiently. The informal nature of code-mixed text presents a significant challenge for NLP systems. It results in a wide range of language variations that include non-standard abbreviations, contracted spellings, and informal grammatical structures. These challenges collectively result in a scarcity of quality code-mixed data that can be used for building NLP systems. We propose a hybrid methodology to generate code-mixed text. The proposed hybrid methodology combines rule-based and statistical approaches to convert a monolingual Telugu sentence into a codemixed sentence in English and Telugu dialects. Due to the complexity of the proposed hybrid approach, we propose a fine-tuned neural machine translation method to generate high-quality code-mixed sentences using minimal gold-standard corpus. We use filters from the gold corpus to ensure that the synthetic training data for the models is of high quality, resulting in improved performance of the neural machine translation models. Considering the recent success of pre-trained models such as mT5 and mBART, we fine-tuned these models. Moreover,our approach outperforms the current systems trained on synthetic data for code-mixed generation in Hindi-English. Apart from Hindi-English, the approach performs well when applied to Telugu, a lowresource language, to generate Telugu-English code-mixed sentences. It is crucial to investigate the effectiveness of filtering techniques for generating high-quality code-mixed data, especially in the case of low-resource languages. Moreover, exploring the application of one-shot and zero-shot learning techniques to determine whether the models are trained to generate code-mixed sentences in general or are specific to the languages they are trained on is also a promising future direction. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |