Wikipedia Article Generation for Named Entities

Author: Yashaswi P
Date: 2018-06-09
Report no: IIIT/TH/2018/24
Advisor:Kamalakar Karlapalem

Abstract

As the largest online encyclopedia, Wikipedia provides a large number of human-edited articles that are linked to named entities. For growing the Wikipedia content, determining whether such entities require their own Wikipedia article and then producing these articles by hand is a laborious task. The limited number of active contributors and the huge time spent in collecting the information and editing is making this manual process slow and difficult. Our thesis presents an automated solution for both determining whether a named-entity warrants its own article in Wikipedia and thereby creating one using a semi-supervised approach. Wikipedia requires a keen investigation about which articles to be included for it to maintain its indispensability. To prevent unnecessary articles from being included, official guidelines of Wikipedia demand these named entities meet “notability” standards for their article inclusion. We investigated automation around notability for named entities by using reliability and entity salience features. Evaluations of our notability determination system provide evidence that our solution can replace the manual decisions made by the reviewers for the inclusion of an article using the notability rules. We also developed a semi-supervised automatic generation of Wikipedia articles as an alternative to its manual creation. A framework was created to generate a Wikipedia article for a named entity, which not only looks similar to other Wikipedia articles in its category but also aggregates the information about diverse aspects related to that named entity from the Web. In particular, a semi-supervised method is used for determining the headings and identifying the content for each heading in the Wikipedia article generated. Additional important contributions here include reducing the manual effort in collecting relevant training data by extracting it from the wiki-dump and increasing the reliability of the content by introducing parameters modeled on the reliability of a site based on existing Wikipedia articles belonging to the same category. Our experiments showed the viability of our solution as an alternative to the previous approaches that used supervised or unsupervised methods. Such an automation for both notability and article creation in Wikipedia can help in paving a path for a future where maintaining consistent growth in the Wikipedia is automatic and requires very little human-interference.

Full thesis: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

Wikipedia Article Generation for Named Entities

Abstract