Neural Approaches for Code-Mixed NLP in Low-Resource Conditions

Author: Aditya Srivastava 201425112
Date: 2022-05-27
Report no: IIIT/TH/2022/56
Advisor:Dipti Misra Sharma

Abstract

Code-mixing is a phenomenon of natural language where multilingual speakers mix two or more of the languages they speak, within the same utterance. This mixture is not made at random, but governed by systematic rules, and is part of highly effective communication. Code-mixing is very common in exchanges between multilinguals in day-to-day conversation, mass-media, pop-culture and on social media networks. NLP systems have historically struggled with processing and understanding code-mixed speech due to the non-standardized nature of code-mixing in informal contexts, such as in informal speech and on social media. The lack of standardization results in extremely high variation in how code-mixing is employed, making the design and creation of rule based NLP systems impractical. The solution has been to look towards statistical systems of machine learning which can automatically learn from data. While there is a large trove of code-mixed data available on the internet, the data is extremely noisy and needs preprocessing and cleaning before it can be made viable for use as input in a machine learning pipeline and be learnt from. Due to the painstaking process of data normalization involved, resources for code-mixed language are scarce and often suffer from poor quality. In this thesis we detail our attempts at improving results on code-mixed NLP in scenarios where the amount of data is the limiting factor. First, we develop an approach for sentiment analysis for Hindi-English code-mixing using neural networks. We describe an architecture for the hierarchical analysis of code-mixed texts at the word and the sentence level, and use it for sentiment classification. Second, we introduce a novel new dataset intended for evaluating code-mixed language models, with parallel translations for Hindi-English code-mixed tweets in both monolingual Hindi and monolingual English. We then establish baselines for the codemixed translation task and describe how one can use such datasets to leverage the Hindi and English pretraining in multilingual models, via our hypothesis that multilingual models learning to code-mix from both component languages will outperform those learning only from a single component language. Finally, we put this hypothesis to the test and show our results for the same.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Neural Approaches for Code-Mixed NLP in Low-Resource Conditions

Abstract