Towards Learning Language Agnostic Features for NLP in Low-resource Languages

Author: Allen Antony
Date: 2021-01-29
Report no: IIIT/TH/2021/1
Advisor:Radhika Mamidi

Abstract

In recent years, Natural Language Processing (NLP) has gained widespread attention in many commercial and academic applications. Neoteric advances in Machine Learning and Deep Learning, aided by the rise of large annotated datasets, is the cornerstone of many state-of-the-art NLP systems and architectures. The one caveat of these advances is the availability of large annotated datasets for a particular NLP task. Since the conception of the Internet and the Digital Age, more and more information is stored digitally. Given the global nature of the current information sharing infrastructure, most of the data generated belongs to one of three languages : English, Mandarin or Spanish. This abundance of raw data, aids and motivates the creation of annotated NLP resources in these languages. On the other hand, the paucity of annotated data in most other languages makes it a challenging task to develop Deep Learning/Machine Learning based solutions for them. Hence there is a pressing need to pay special attention to develop novel solutions capable of performing NLP tasks in a low-resource setting. In this thesis, we attempt to tackle this data scarcity problem by introducing a novel approach for language invariant NLP which is capable of leveraging multiple monolingual datasets for training without any form of cross-lingual supervision. The proposed approach attempts to learn language agnostic features via adversarial training on multiple resource-rich languages, which can then be leveraged for inference on a low-resource language. The robustness of the proposed approach was tested on two well-defined NLP tasks: 1. Sentiment Analysis : For classifying the sentiment of a given document in a low-resource language we introduce the Language Invariant Sentiment Analyzer (LISA) architecture which learns language invariant sentiment features that outperforms the previous state-of-the-art methods on the Multilingual Amazon Review Text Classification dataset and achieves significant performance gains over prior work on the low-resource Sentiraama corpus. 2. Open Domain Event Detection : In the case of Open Domain Event Detection, which is a sequence labeling task, we introduce the Multi-Lingual Sequence Tagger (M-LiST) architecture which attained state-of-the-art performance in three languages of the TempEval2 corpus. A detailed analysis of our research highlights the ability of our architectures to achieve state-of-theart performance in the presence of minimal amounts of training data for low-resource languages.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards Learning Language Agnostic Features for NLP in Low-resource Languages

Abstract