Text Classification for Telugu: Datasets, Embeddings and Models for Downstream NLP Tasks

Author: Mounika Marreddy
Date: 2023-05-25
Report no: IIIT/TH/2023/57
Advisor:Radhika Mamidi

Abstract

Language understanding has become crucial in different text classification tasks in Natural Language Processing (NLP) applications to get the desired output. Over the past decade, machine learning and deep learning algorithms have been evolving with efficient feature representations to give better results. The applications of NLP are becoming potent, domain, and language-specific. For resource-rich languages like English, the NLP applications give desired results due to the availability of large corpora, different kinds of annotated datasets, efficient feature representations, and tools. Due to the lack of large corpora and annotated datasets, many resource-poor Indian languages struggle to reap the benefits of deep feature representations. Moreover, adopting existing language models trained on large English corpora for Indian languages is often limited by data availability, rich morphological variation, syntax, and semantic differences. Most of the work being done in Indian languages is from a machine translation perspective. One solution is to use translation for re-creating datasets in low resource languages from English. But in case of Indian languages like Telugu, the meaning may change and some crucial information may be lost due to translation. This is because of their structural differences, morphological complexities, and semantic differences. In this thesis, our main objective is to mitigate the low-resource problem for Telugu. Overall, to accelerate NLP research in Telugu, we present several contributions: (1) A large Telugu raw corpus of 80,15,588 sentences (16,37,408 sentences from Telugu Wikipedia and 63,78,180 sentences crawled from different Telugu websites). (2) Annotated datasets in Telugu for sentiment analysis, emotion identification, hate speech detection, sarcasm identification, and clickbait detection. (3) For the Telugu corpus, we are the first to generate pre-trained distributed word and sentence embeddings such as Word2Vec-Te, GloVe-Te, FastText-Te, MetaEmbeddings-Te, Skip-Thought-Te. (8) We pre-trained different contextual language models for Telugu such as ELMo-Te, BERT-Te, RoBERTa-Te, ALBERT-Te, Electra-Te, and DistilBERT-Te, word and sentence embeddings using graph-based models: DeepWalk-Te and Node2Vec-Te, and Graph AutoEncoders (GAE). (4) We propose the multi-task learning model (MT-Text GCN) to reconstruct word-sentence graphs on TEL-NLP data while achieving multi-task text classification with learned graph embeddings. We show that our pre-trained embeddings are competitive or better than the existing multilingual pre-trained models: mBERT, XLM-R, and IndicBERT. Lastly, the fine-tuning of pre-trained models show higher performance than linear probing results on five NLP tasks. We also experiment with our pre-trained models on other NLP tasks available in Telugu (Named Entity Recognition, Article Genre Classification, Sentiment Analysis, and Summarization) and find that our Telugu pre-trained language models (BERT-Te and RoBERTa-Te) outperform the state-of-the-art system except for the sentiment task. We hope that the availability of the created resources for different NLP tasks will accelerate Telugu NLP research which has the potential to impact more than 85 million people. In this thesis, we aim at bridging the gap by creating resources for different NLP tasks in Telugu. These tasks can be extended to other Indian languages that are closer to Telugu culturally and linguistically by translating these resources without losing information like verb forms, cultural terms, vibhaktis etc. This is the first work that employs neural methods to Telugu language —a language that does not have good tools like NER, parsers and embeddings. Our work is the first attempt in this direction to provide good models in Telugu language by exploring different methods with available resources. It can also help the Telugu NLP community evaluate advances over more diverse tasks and applications. We open-source our corpus, five different annotated datasets (SA, EI, HS, SAR, and clickbait), lexicons, pre-trained embeddings, and code here1 . The pre-trained Transformer models for Telugu are available here2.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Text Classification for Telugu: Datasets, Embeddings and Models for Downstream NLP Tasks

Abstract