IIIT Hyderabad Publications |
|||||||||
|
Techniques for Text Classification and Text Generation: Enhanced Online Sexism Detection and Template driven Wikipedia Article GenerationAuthor: Jayant Panwar 2019114013 Date: 2024-04-25 Report no: IIIT/TH/2024/36 Advisor:Radhika Mamidi AbstractText classification and generation are linchpins in the explosive growth of the internet, seamlessly organizing and generating vast volumes of textual information. These pivotal techniques not only enhance Natural Language Processing but contribute indispensably to the dynamic evolution of online communication and knowledge dissemination in the modern world. This is why we have selected Sexism Detection and automated Wikipedia article generation in Indian languages as our topics for researching techniques in text classification and text generation, respectively. Detecting sexism in text promotes inclusivity and gender equality online, whereas generating Wikipedia articles in Indian languages enhances the accessibility of knowledge to diverse communities. With the rapid growth of online communication on social media platforms, there has also been an increase in the amount of hate speech, especially in terms of sexist language online. The proliferation of such sexist language has an impact on the mental health and well-being of the users and, hence, underscores the need for automated systems to detect and flag such pieces of text online. In this thesis, we explore the effectiveness of both conventional machine-learning techniques and different modern techniques applied to BERT-based classifiers for detecting sexist text in the EDOS shared task. The results show that the performance of the baseline conventional classifiers and BERT-based classifiers is greatly improved when techniques like Data Resampling, Data Augmentation, Domain Adaptive Pre-training, and Learning Rate Scheduling are implemented. The surge of data on the internet has also seen the rise of a knowledge giant, Wikipedia. In today’s world, it is seen as one of the primary online sources for the distribution and consumption of free knowledge. Wikipedia users collaborate in a structured and organized manner to publish and update articles on numerous topics. English Wikipedia ranks at the top when it comes to the amount of information available for any language (> 6.7 million articles). However, the amount of information available in Indian languages is lagging far behind. The same article in Hindi may be vastly different from its English version and generally contains less information. This poses a problem for native Indian language speakers who are not proficient in English. Therefore, having the same amount of information in Indian languages will help promote knowledge among such speakers. Publishing the articles manually, which has been the status quo for decades, is a time-consuming process. To get the amount of information in native Indian languages up-to-speed with the amount of information in English, automating the whole article generation process is the best option. In this thesis, we present a stage-wise approach ranging from Data Collection to Summarization and Translation, and finally culminating with Template Creation. This approach ensures the efficient generation of a large amount of content in Hindi Wikipedia in less time. All in all, this work aims to develop automated systems to foster a safe digital space for users of social media platforms and propose a stage-wise approach for the efficient generation of Wiki articles, hence promoting the dissemination of knowledge for native Indian language speakers. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |