IIIT Hyderabad Publications |
|||||||||
|
MULTILINGUAL IDENTIFICATION AND MITIGATION OF BIAS AND CLICKBAITY CONTENTAuthor: Anubhav Sharma Date: 2023-06-05 Report no: IIIT/TH/2023/62 Advisor:Radhika Mamidi AbstractHandling fine-grained subtleties in text remains a challenge, despite the advances in the understanding and generation of textual content. The difficulties in the identification and segregation of such subtleties arises from the difficulty in sufficient human comprehension of these aspects and their implicit presence in the data used to train NLP models in understanding and generation tasks. In this thesis, we primarily focus on two types of such content: bias and clickbaits. Textual bias, which is implicitly manifested across corpora, arises from individual inclinations, perspectives and interpretations of facts and serves to distort the understanding of the information from the discourse by imposing subjective opinions. The presence of such affective content is a nuisance for encyclopedic platforms like Wikipedia that serve to provide knowledge from a neutral point of view and are used as a reliable source worldwide. Clickbaits are another form of malicious content that serve to distract user attention on social media websites. They usually work by luring a user to click on linked articles which often contain only trivial to no useful informational content in contrast to the intensity suggested by the “bait”, and result in unnecessary user frustration and potential masking of helpful information by diverting attention that could be paid to websites with objective information. While bias identification and mitigation have been well studied problems in NLP on the English language and there have been multiple definitions of bias from the perspectives of different corpora and objectives, we fix our definition from perspective of the neutral point of view on Wikipedia. Also, we seek to study the bias problem on a multilingual scale, dealing with native languages from the South Asian subcontinent as a study on low resource languages in conjunction with English. Prior to taking up these multilingual problems on the Wikipedia domain, we precede the same with a study of three different problems in this setting. In our first multilingual problem, we propose a novel architecture for fine-grained categorization of entities on a Wikipedia-based dataset of 30 languages. Following this, we set out to study two dual problems in a cross-lingual setting. In the first of these two, we aim at content enrichment in low resource languages making use of factual information from structured knowledge bases in English and explain the creation of a novel dataset for this study. In the other problem, we propose two methods for extraction of English facts from unstructured content in low resource languages with the aim of adequate utilization of the knowledge contained in such languages. A study of these three problems helped lay a solid foundation for the study of more subtle problems like bias on a multilingual scale. As a part of our study on bias identification and mitigation, we present several attempts at creating a sizeable multilingual, parallel dataset making use of edit tags on Wikipedia. We curate and propose our dataset for study which is based on translation from existing datasets in English. Following dataset creation, we present our modeling of the two problems of bias detection as a classification problem and bias mitigation as a style transfer problem through extensive experiments carried out in monolingual and multilingual settings. With the existing methods for evaluation of textual debiasing not sufficient for the purpose, we design an evaluation strategy combining traditional generation-based metrics with two additional metrics measuring the percentage of change and bias classification accuracy on the generated output. We also present several directions for extending this problem, following an analysis of our results. We approach the problem of mitigating the effects induced by a clickbait by looking at generation of “spoilers”, or short pieces of content to satisfy the curiosity generated by the clickbait. We attach the problem as a 2-stage pipeline, where the first stage predicts the type of spoiler that will be generated, and in the second stage, a spoiler type-specific spoiler generator extracts the necessary content from the article. We propose a novel Information Condensation-based modeling approach to tackle this problem, where we add filtering to the article associate with the clickbait, which helps wade out a lot of the potentially unnecessary information from the article. The article with condensed information is then used for the 2-stage problem. Our experiments reveal the merit of a contrastive learning-based method to design the filtering model, as opposed to simpler classification-based methods. We achieve SoTA results on the problem, and present an extensive analysis of our techniques used. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |