Learning Representations for Text Classification of Indian Languages

Author: Nurendra Choudhary
Date: 2019-02-01
Report no: IIIT/TH/2019/8
Advisor:Manish Shrivastava

Abstract

Text Classification is among the primary challenges of natural language processing with multitude of applications including but not limited to sentiment analysis, emoji prediction and topic modeling. Several approaches ranging from rule-based systems to machine learning models have been applied to solve text classification. Neural network models, especially, have shown promising results. However, the models are limited by their reliance on the amount of annotated data. Machine learning models, primarily, operate on numerical data that represent the features of its input. In the case of text classification, the models need extraction of text features relevant to the given problem. This requirement led to the advent of a more specific area analyzing feature engineering called representation learning. Current approaches to learning sentence representations has predominantly been centered around the semantics of a sentence. However, semantic features are often inconsequential in text classification problems. Thus, to bolster task-specific representation learning, models need to learn text features according to the problem, rather than adopting a pre-compiled representation model. With the proliferation of Internet and its penetration into multilingual societies, the linguistic diversity of the real world is now reflected in online communities. Limiting solutions to a set of major languages is no longer viable. Nevertheless, the proliferation is relatively recent and hence the amount of available data in many widely-spoken languages is inadequate. Manual annotation of data is a humongous task and often expensive. Additionally, morphologically rich languages’ property of inflection and agglutination causes data sparsity problems. The possibility of utilizing resource-rich languages as leverage to enhance the performance of text classification in resource-poor languages is fascinating. To this end, the thesis presents a twin Bidirectional Long Short Term Memory (Bi-LSTM) network with shared parameters consolidated by a contrastive loss function (based on a similarity metric). Fundamentally, the aim is to classify text into multiple categories based on its features. The model jointly learns the text features (or representations) of resource-poor and resource-rich languages in a mutually shared space by utilizing the similarity between their assigned categories. Shared parameters of the Siamese network enable the projection of sentences into a common space and the similarity metric ensures the correctness of the projection. Essentially, the model projects sentences with similar categories closer to each other and sentences with different categories farther from each other. Additionally, this enables the leveraging of resource-rich languages to enhance text classification of resource-poor languages. The reason is that the projection is based on the language-agnostic annotation tags. So, pairing resource-poor and resource-rich sentences together as input to the Bi-LSTMs pair will result in their projection into a shared space. Hence, the resource-rich sentences aid the classification of resource-poor sentences by composing more training samples. Experiments, specifically, on the text classification tasks of Multilingual Sentiment Analysis and Multilingual Emoji Prediction with validation against large scale standard datasets reveal that the model significantly outperforms the state-of-the-art approaches in the case of both resource-poor and resource-rich languages. Furthermore, it is also observed that jointly training resource-poor and resource-rich languages exhibit significant performance enhancement over its single resource-poor language counterparts.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Learning Representations for Text Classification of Indian Languages

Abstract