IIIT Hyderabad Publications |
|||||||||
|
Automatic Generation of Hindi Wikipedia PagesAuthor: Aditya Agarwal 20161104 Date: 2024-06-22 Report no: IIIT/TH/2024/113 Advisor:Radhika Mamidi AbstractNatural Language Generation (NLG) is a computer process that uses artificial intelligence to produce written or spoken language from structured or unstructured data [3]. Its purpose is to enable computers to communicate with users in a way that is understandable, rather than in computer language. NLG focuses on creating coherent written content in human languages like English, based on underlying data. Given the vast amount of text data available, NLG techniques are crucial for organizing and presenting information, with Wikipedia being a leading resource in this effort. [15] Wikipedia is an online encyclopedia that’s available in multiple languages and is freely accessible, thanks to contributions from volunteers known as Wikipedians. This collaborative platform uses a wiki-based editing system called MediaWiki. It holds the title of being the most extensive and most accessed reference work in history. Consistently ranked among the top 10 most popular websites by Similarweb and previously by Alexa, Wikipedia is hosted by the Wikimedia Foundation, a non-profit organization based in the United States, which relies on donations to operate. Natural Language Generation in Wikipedia involves creating articles in various languages, either through WikiBot or manual efforts. The linking of language versions on Wikipedia has been improved with the introduction of Wikidata, a unified system that uses unique identifiers for entities and their attributes.[8]. English Wikipedia sees the addition of about 500 articles daily, but Hindi Wikipedia lacks such growth, with only 150,000 pages compared to English’s 54 million articles. To enhance Natural Language Generation and maintain Wikipedia’s multilingual aspect, creating more detailed Hindi pages is crucial. This thesis proposes a method for automatically generating Hindi Wikipedia articles using Wikidata as a knowledge source [26]. The process involves extracting structured data from Wikidata, including entity names, properties, and relationships, and then generating natural language text based on predefined templates for the subject area. We tested our method by generating articles about scientists and compared them to machine-translated ones. Results show over 70% of the articles produced using our method are superior in coherence, structure, and readability. This approach has the potential to significantly reduce the time and effort needed to create Hindi Wikipedia articles and can be extended to other languages and domains. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |