Named Entity Extraction and Knowledge Base Enhancement

Author: Priya Radhakrishnan
Date: 2018-04-03
Report no: IIIT/TH/2018/9
Advisor:Vasudeva Varma,Manish Gupta

Abstract

Past decade witnessed an explosive growth in the amount of unstructured data, especially in the public domain, mainly due to Web 2.0 and social media. This has created a need for applications that extract structured information from such noisy data. Automatic extraction of structured information from unstructured data is called information extraction. Structured information thus extracted is stored in a Knowledge Base. The knowledge base stores facts about entities like name, type and other attributes. On the one hand, the information extraction task utilizes the facts of the entity stored in knowledge base to refine the extraction process, while on the other, the facts extracted refines the knowledge base facts further. Thus knowledge base provides structure and guidance to the extraction task, and gets enhanced by the results of the extraction task. Here we see that the tasks of entity extraction and knowledge base enhancement are mutually dependent and mutually beneficial. Hence this thesis proposes methods to enhance both the tasks, in an effort to build a strong and sound named entity extraction system. Documents typically talk about multiple named entities. All the named entities mentioned in the document are not equally important to the content of the document. Named entities that are important to the document are called salient named entities. In this thesis we propose a method to identify salient named entity of the document. Importance of a named entity can also be judged by understanding how the named entity is semantically related to other named entities mentioned in the document. We propose a method to identify the presence of such semantic relations within named entities. For example, semantic relation like attribute or category in a product title. Understanding the salience of a named entity and its semantic relations help the Named Entity Extraction task in extracting the important information from text, while filtering out unimportant information. Performance of Named Entity Extraction methods depend on the size and structure of the context of the named entity mention in the text. While bigger size and better structure of context results in improved performance of the Named Entity Extraction, lower size and poor structure of context results in reduced performance. We propose three different Named Entity Extraction approaches tailored to the varying size and structure of the context in this thesis. The proposed methods perform on par with state-of-the-art methods with improved latency. Named Entity Extraction approaches that work on lesser context and poorer structure, increasingly depend on non-textual signal like global coherence of entities in the knowledge base. Conventional EL performs well for popular entities but performs poorly for less popular (a.k.a tail) entities, because conventionally EL methods depend on richness of entity neighborhood in the Knowledge Graph (KG). In this context ‘KB’ refers to Knowledge Base like Wikipedia and ‘KG’ refers to Knowledge Graph like Wikipedia Hyperlink Graph. Tail entities have sparse entity neighborhood and hence EL methods perform poorly on them. In this thesis we propose ELDEN, an EL system that overcomes the degree sparsity problem of tail entities. ELDEN enriches entity’s neighborhood in a KG by extracting high quality mentions of entity from a web corpus using Pointwise Mutual Information (PMI) measure. ELDEN outperforms state-of-the-art EL systems while achieving significant improvement in linking tail entities, achieving best results on CoNLL and TAC datasets. We follow up the discussion on ELDEN with a discussion on information retrieval task that is improved by use of KB entities. We propose a novel method for enhancing classification performance of research papers into ACM computer science categories using KB entities, both Wikipedia and Freebase entities. All through this thesis we present five methods of improving Named Entity Extraction using Knowledgebases. We conclude the thesis by looking at how Knowledgebases can be improved with Named Entity Extraction. We review and analyze the main approaches of New Entity Identification (NEI) in Named Entity Extraction systems. We analyze the features and share insights from reproducing state-of-the-art results, suggesting future improvements.

Full thesis: pdf

Centre for Search and Information Extraction Lab

IIIT Hyderabad Publications

Named Entity Extraction and Knowledge Base Enhancement

Abstract