IIIT Hyderabad Publications |
|||||||||
|
Humorous and Professional – A Dive Into Social Media Text ClassificationAuthor: TANISHQ CHAUDHARY 2019114007 Date: 2024-05-07 Report no: IIIT/TH/2024/61 Advisor:Radhika Mamidi AbstractSocial media is now a deep-rooted part of our daily lives. Whether for entertainment or information acquisition, it serves as a communication hub for billions of users. In this thesis, we dive into the realm of text classification by taking a two-fold approach, namely, contrasting text in professional and humorous fields. First, we understand the nuances of human communication via a previously unexplored social media platform, Blind. Next, we identify how the nuances of human communication are exploited by looking at humor. Our aim is to conduct a thorough analysis of these contrasting worlds to demonstrate that they work on the same underlying structures and goals. This provides a comprehensive analysis of the landscape within social media. In the non-humorous domain, Blind has emerged as an anonymous platform with the unique goal of satisfying the growing need for taboo workplace discourse. Employees come on the platform to discuss issues ranging from layoffs, compensation, interview advice, career progression and more. In our work, for the first time, we explore the platform in detail by scraping and analyzing two datasets: 767,224 Blind Posts and 63,477 Blind Company Reviews containing seven years of industry data. Using the Blind Posts dataset, we dissect the popular discussion topics of employees, find mappings of global events like work-from-home, return-to-office, and layoffs, and aggregate the sentiments of the platform for a comprehensive temporal analysis. We then propose our novel content classification pipeline. We first filter relevant content with an accuracy of 99.25% and then further annotate relevant textual context into ten categories with an accuracy of 78.41% based on the Blind Posts. Using the Blind Company Reviews, we conduct content and metrical analyses on the data for a complete view of the platform and complete our novel content classification pipeline, by adding the ability to mine opinions of employees, with an accuracy of 98.29%. For the humor domain, we utilize the Short Jokes dataset which has data from r/jokes and r/cleanjokes subreddits on Reddit, totaling 231,657 text jokes. After getting the humorous data, we use linguistically motivated features inspired by the Incongruity theory of humor and the General Theory of Verbal Humor (GTVH). These features allow us to consider humor instruments from the phonetic level to the pragmatic level, considering things like alliteration chain lengths, text polarity, slangs, etc. We train multiple machine learning and transformer models and achieve an accuracy of 63% and 98.90%, respectively. To understand the rift in the results better, we analyze the style and the semantics of the text in detail. Finally, we formalize the results across tasks and explain the consistently superior results of transformers. We finally gain valuable insights into the common underlying structure of text classification tasks. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |