Generating category-specific entity embeddings for populating Knowledge Graphs

Author: Gokul Thota 2019111009
Date: 2024-05-23
Report no: IIIT/TH/2024/59
Advisor:Vasudeva Varma

Abstract

Knowledge graphs (KG), which are structures representing information corresponding to entities/topics and their inter-connections, have been playing a crucial role in leveraging information on the web for several downstream tasks such as text-generation, classification, etc. Hence, it becomes vital to construct and maintain such knowledge graphs. There are some previous efforts in populating such KGs and generating relevant entity/node embeddings for this task. However, these methods typically do not focus on analyzing entity-specific content exclusively, but rely on transformational techniques on a fixed collection of documents with certain entities. We define an approach to populate such KGs by utilizing entity-specific content on the web, for generating category-specific entity embeddings. We empirically prove our approach’s effectiveness, by utilizing it for a downstream task of Notability detection, associated with one of the most popular and important Knowledge Graphs - the Wikipedia platform. Wikipedia is a highly essential platform because of its informative, dynamic, and easily accessible nature. The rate of new content being uploaded to it is very high, which makes it essential to moderate this uploaded content. To ensure that only important and relevant content is uploaded to Wikipedia, its editors define a specific set of "Notability" guidelines. These guidelines indicate whether a particular title warrants its own Wikipedia article. So far notability is enforced by humans, which makes scalability an issue, and there has been no significant work on automating notability detection across diverse categories. We work on this problem of creating an automated system to detect the notability of different types of articles/pages, for a vast set of categories. It is not a trivial task to define a fixed procedure for determining the Notability of Wikipedia pages, as there are different types of pages in Wikipedia, in the way they correlate with the various categories in which they exist. For a given Wikipedia category, articles/pages that are simple category instances co-exist with pages that are associated with the category in a non-trivial manner. It is essential to distinguish Wikipedia pages based on this fundamental difference, to gauge the notability of the page accordingly, as the parameters to look for performing this Notability test vary in each case. We divide this problem into two components, based on the nature of an article’s title. We define two types of article titles - Simple titles and Complex titles. Simple titles correspond to simple category instances/named entities for a given category. For example, "Virat Kohli" is a simple title of the category "Cricket Players". Complex titles correspond to article titles that have complex dependencies with their category. For example, an article titled "Wake Island" might be present in the "Birds" category, because of its association with the category, but not because it represents an instance of a bird. This distinction helps us define the categories to analyze for generating category-specific embeddings. Articles with simple titles are further divided into two classes, the "Abstract" class and the "Generic" class, based on whether they represent abstract concepts (such as Temperature / Pressure) or not, respectively, as the process for notability detection is to be followed differently in each case. We construct a dataset with notable and non-notable samples, for 9 categories belonging to the Generic class and 5 categories belonging to the Abstract class. On the other hand, for articles with complex titles, another dataset is constructed for the 9 categories of the Generic class, as defining complex titles for conceptual entities in the Abstract class is non-trivial. We further design a generalizable mechanism to differentiate between simple and complex titles. Initially, we specifically worked on designing a notability detection system for articles with simple titles. This system is based on web-based entity features and their text-based salience encodings. We further incorporate neural networks and BERT encodings (transformer encoder) to perform binary classification. For validating our system’s performance in this task, we utilize accuracy metrics, correlation analysis, ablation study, and prediction confidence on popular Wikipedia pages. Our system outperforms machine learning-based classifier approaches and existing handcrafted entity salience detection algorithms. Further, we define a system to detect notability specifically for articles with complex titles. This system is primarily defined on the basis of web-based features and the salience of a title in its web-based documents’ text. We train a Graph neural network (GNN) that generates attention-enhanced encodings for classification, with syntactic and semantic document graphs as inputs. We evaluated this system similar to the above system for simple titles and observed that it outperforms existing ML-based, naive transformer-based classifiers and handcrafted entity salience methods. Overall, we define two multipronged systems, which perform the task of generating categoryspecific embeddings, for performing notability detection of different types of article titles - Simple and Complex, that exist on the Wikipedia KG. We construct corresponding datasets for both types of article titles and evaluate our systems with respect to these datasets, respectively. These systems provide an efficient and scalable alternative to manual decision-making about the importance of a particular topic, irrespective of its category or nature. Based on the empirical proof of the system’s effectiveness, it can be concluded that the approach utilized in defining the systems can be extended to any KG-structure, to generate category-specific embeddings.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Generating category-specific entity embeddings for populating Knowledge Graphs

Abstract