Utilising a Dataset for Two Text Classification Tasks: Sentiment Analysis and Author Identification

Author: Ojaswi Binnani
Date: 2023-01-21
Report no: IIIT/TH/2023/5
Advisor:Radhika Mamidi

Abstract

The Internet has many tools that connect people from different ends of the globe. Geographical boundaries do not hinder the connection between two people. Communication is easily facilitated, and news is easily spread. However, this connectivity can be exploited and used negatively. Bullying, hate speech, and fake news can all be spread across the internet, and the common denominator in combating these issues is Sentiment Analysis. Sentiment Analysis is the task of finding the sentiment (positivity or negativity) of a text. This process can be an essential feature for bullying detection, hate speech detection, and fake news detection models. However, there are many libraries and functions available to use, and choosing the optimal one that has the most accuracy to use as a feature for other Natural Language Processing (NLP) tasks is essential. During our search to find functions that calculate sentiment, otherwise known as pre-trained sentiment models, we find that most of the models are trained on polarised texts and datasets that already have sentiment annotated. Hence, we use a nonpolarised text form: news articles and we use a dataset that is not meant to be used with Sentiment Analysis. This helps us test different models. Through our experiments we find that VADER, Stanza, and Transformer’s distilBERT work with a good accuracy, and are ideal to use as the basis for calculating a sentiment feature whereas lexical-based models such as TextBlob and using SentiWordNet are less accurate in its calculations. The dataset we use is the Reuter 50 50 dataset that is meant to be used for another NLP task: Author Identification/Author Profiling. [60] Working with the dataset led to our second task: to create a model that works with minimal requirements for data and computational power but still gives a comparable accuracy to other deep learning methods of Author Identification. Author Identification is that task of being given a random text that could have been written by a group of suspect authors, and finding with reasonable accuracy, which author wrote the text. This is useful to stop other people taking credit for works they have not written and giving credit to the rightful author. We find that using a mixture of word-based features, stylometric feature, and syntax/punctuation features are good features to train a model for Author Identification. We also find that the Boosting class of algorithms outperforms other classes such as Linear models, Nearest Neighbour-Based Models, and Tree Models. XGBoost performs the best out of all the algorithms we test. Linear Models such as SVMs, the Naive Bayes, and the K-Nearest Neighbour algorithm perform abysmally.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Utilising a Dataset for Two Text Classification Tasks: Sentiment Analysis and Author Identification

Abstract