Sentence Classification for Morphologically Rich Languages

Author: Madhuri Tummalapalli
Date: 2018-11-26
Report no: IIIT/TH/2018/88
Advisor:Radhika Mamidi

Abstract

Sentence Classification is one of the most fundamental tasks in Natural Language Processing (NLP), wherein the aim is to classify any given sentence into one of a given set of classes. The set of classes usually depends upon the specific task under consideration, such as determining the type of answer in Question Classification or finding the sentiment of a sentence in Sentiment Analysis or finding whether a sentence is subjective or objective in Subjectivity Analysis. These are some of the tasks that can be included in Sentence Classification. Some examples of other tasks being sarcasm or humour detection, detection of abusive text etc. In sentence classification, the aim is to have a common network architecture for a range of classification tasks, rather than building an independent one for each task. In other words, the aim is to build a task-independent classification system. Many methods have been developed for various sentence classification tasks for English, which usually exploit linguistic resources like parsers making it difficult to adapt them to other languages. There are a huge number of languages spoken around the world, and it is not possible to build an independent classifier for each task in every language. Thus, in this thesis, we explore language-independent task-independent sentence classification focusing on improvement in classification of morphologically rich languages. We perform sentence classification for five datasets in two tasks - Sentiment Analysis and Question Classification and three languages - English, Hindi and Telugu. We present an evaluation of popular deep learning methods for sentence classification on the morphologically rich Indian languages, specifically Hindi and Telugu. For this purpose, we also created a question classification dataset for Hindi, by translating the TREC-UIUC English dataset. We show that character based input can enhance the performance of current classification systems for morphologically rich languages. Finally, we show that our proposed multiInput-CNN variant is able to perform better than our baselines in two out of three tasks in Hindi and Telugu, while giving comparable results for others. Since a huge proportion of the work in sentence classification is specifically for English, the methods are often designed to best suit that language. A lot of the works represent the input sentences as asequence of words in their models. Only a few of them rely on character level representation. Through this work, we introduce a new method for representing a sentence - as a sequence of syllables. It is essential to capture sub-word level information while classifying in a morphologically rich language. Syllables are an ideal choice for this, as in many languages most of the morphemes can be seen as n-grams of syllables. Thus, making them a more intuitive choice when compared with character n-grams to capture morphological information. Also, syllables can lead to a reduction in noise caused due to the usage of character n-grams. Through extensive evaluation, we show that syllables are the best performing input type when compared to words or characters for the morphologically rich languages - Hindi and Telugu. Finally, we make use of attention mechanism to automatically find the best combination of inputs for every dataset and language. We experiment with a number of different attention based CNN models in this part and find that attention can be used to obtain a model that performs decently well across all datasets.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Sentence Classification for Morphologically Rich Languages

Abstract