Towards Developing a Lexical Ontology Resource and Augmenting Novel Approaches for Sentiment Analysis Task through Enrichment of Available Resources in Telugu

Author: Sreekavitha Parupalli
Date: 2018-10-19
Report no: IIIT/TH/2018/89
Advisor:Radhika Mamidi

Abstract

The major contribution of this thesis is creation and enrichment of resources (a lexicon) for a resource poor language viz. Telugu. This thesis also describes the enrichment of OntoSenseNet-a verb-centric lexical resource. Our work aims to preserve and enhance the usage of an authentic Telugu dictionary by developing a computational version of the same. This enables the native speakers of the language to actively involve in the research as most of the computer science experts or algorithms work on top of the annotated language data. With the aid of developed Telugu dictionary, native speakers can perform better annotations as both the word and its meaning are in a language they are familiar with. Hence, efforts are made to develop the aforementioned Telugu lexical resource and numerous annotations are done manually by language experts. Primarily, we attempted two types of annotations 1) Ontological classification of verbs, adverbs and adjectives; 2) Annotation of unigrams and bigrams for sentiment polarity. Based on the proposed ontological classification, the manually annotated gold standard corpus consists of 8483 unique verbs and 253 unique adverbs. Annotations are done by native language speakers according to the set of provided annotation guidelines. We discuss annotation procedure in detail and present the validation of the developed resource through inter-annotator agreement as a measure. Additional words extracted from Telugu WordNet are combined with our resource and annotations are done. Furthermore, we discuss the enrichment of this manually developed resource of Telugu lexicon, OntoSenseNet. OntoSenseNet is a ontological-sense annotated lexicon that classifies each verb into 7 sense-types and adverbs into 4 sense-classes. The developed OntoSenseNet for Telugu has primary and secondary sense-types identified for the verbs and primary sense-class tag for adverbs. The area of research is relatively recent but has a large scope of development. We provide an introductory work to enrich the OntoSenseNet to promote further research in Telugu. Classifiers are adopted to learn the sense-type of the words in the resource and thus, we automate the tagging of sense-types for verbs. We perform a comparative analysis of different classifiers applied on OntoSenseNet. The results of the experiment prove that automated enrichment of the resource is effective using SVM classifiers and Adaboost ensemble. However, the accuracy is low compared to the task of manual annotations. To perform manual annotations more extensively, we have developed a tool for crowd-sourcing the task of annotating. Access to contribute to the resource is given only to certified individuals after preliminary assessment.This tool consists of guidelines for annotations and list of words that are to be annotated and the annotator is given the freedom to choose “uncertain” option in case of an unclear judgment. Mechanisms adopted to minimize the disagreements and measures are taken while adding these annotations to our resource are discussed in this thesis in detail. Additionally, we discuss the potential applications of this ontological resource. Moreover, efforts have been put to enhance the sentiment analysis task through phrase-level annotations. We developed a systematically annotated corpus that can support the enhancement of sentiment analysis tasks in Telugu by annotating bigrams in Telugu. These are the second kind of annotations that are mentioned before. The developed polarity-annotated corpus is called ‘BCSAT’. From the developed Telugu dictionary, we extracted 11,000 adjectives, 253 adverbs, 8483 verbs. We extracted words from SentiWordNet and bigrams from target corpus. Sentiment based polarity annotations for these extracted lexemes are done by language experts. We discuss the methodology followed for the polarity annotations and provide validation for the developed resource. The fundamental aim is to validate and study the possibility of utilizing phrase-level sentiment annotations in the task of automated sentiment identification. This work aims at developing a benchmark corpus, as an extension to SentiWordNet, and baseline accuracy for a model where phrase-level sentiment annotations are applied for sentiment predictions. The method we present outperforms all known methods when tested on the recognized and standard benchmarks for sentiment analysis task in Telugu.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards Developing a Lexical Ontology Resource and Augmenting Novel Approaches for Sentiment Analysis Task through Enrichment of Available Resources in Telugu

Abstract