IIIT Hyderabad Publications |
|||||||||
|
Grounded Content Automation: Generation and Verification of Wikipedia in Low-Resouce languages.Author: Shivansh Subramanian Date: 2024-06-07 Report no: IIIT/TH/2024/91 Advisor:Vasudeva Varma,Manish Gupta AbstractIn this thesis, we work towards improving the representation of low-resource languages in the digital world by easing the access and participation of these communities to reliable information hubs like Wikipedia. Although the internet has brought in an information age, it is disproportionately distributed amongst language communities since content and tools for low-resource languages are less readily available. Recognizing the importance of Wikipedia as the primary source of reliable, unbiased information, we seek to improve the information available by automatically generating Wikipedia articles in low-resource languages to improve the quality and quantity of articles available. Our work begins with XWikiGen, a cross-lingual multi-document summarization task that aims to generate Wikipedia articles using reference texts and article outlines. We propose the XWikiRef dataset to facilitate this, which spans eight languages and five distinct domains, laying the groundwork for our experimentation. We observe that existing Wikipedia text generation tools rely on Wikipedia outlines to provide a structure for the article. Hence, we also propose Multilingual Outlinegen, a task focused on generating Wikipedia article outlines with minimal input in low-resource languages. To support this task, we introduce another novel dataset, WikiOutlines, which encompasses ten languages over eight domains, further enriching available multilingual tools for further research work. An important question with text generation is the reliability of the generated information. For this, we propose the task of Cross-lingual Fact Verification (FactVer). In this task, we aim to verify the facts in the source articles against their references, addressing the growing concern over hallucinations in Language Models. We manually annotate the FactVer dataset for this task to benchmark our results against it. By exploring these three tasks, we highlight the disparity in content and tools available in low-resource languages, underscore the importance of multilingual and cross-lingual tools in global participation and propose innovative solutions to enhance Wikipedia’s accessibility and reliability for low-resource languages. Overall, we contribute multiple novel datasets and methodologies to automatic text generation and highlight the importance of inclusivity in the Internet age. By tackling the challenges of article generation, outline generation and fact verification, we pave the way for future advancements that promise to improve the quality and quantity of information available to low-resource language communities of the world. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |