IIIT Hyderabad Publications |
|||||||||
|
Towards Extracting and Utilising Entities in Task Specific Low Resource SettingsAuthor: Swayatta Daw Date: 2023-05-15 Report no: IIIT/TH/2023/53 Advisor:Vikram Pudi AbstractThe task of entity extraction has been thoroughly explored in the NLP community over the last two decades. There is a myriad of downstream tasks for Natural Language Understanding that utilizes entity extraction - ranging from Information Retrieval, Question Answering, Fact Extraction and Verification, Knowledge Graph Completion, etc. However, the datasets and domains that have been used for evaluation and benchmarking such entity extraction models have mostly been straightforward, structurally simple, semantically non-complex, and well-present across the breadth of training data. Most entity extraction tasks comprise of a significant overlap in entities across train and test sets. This does not mimic the real-world scenario, where rare emergent entities show a larger presence, and the entities themselves are often complex and semantically ambiguous. Our work investigates entity extraction in 2 settings - in the domain of scientific research papers, and in the area of linguistically low-resource structurally complex settings. In the first scenario, we establish an end-to-end pipeline that extracts certain task-specific entities from research documents, that help in a meaningful mapping of the research landscape. We show that the domain of scientific documents is non-trivial for entity extraction tasks because scientific entities follow a long-tail distribution. We incorporate multiple strategies like Distant Supervision, Graph Ranking, and Sequence Labelling to solve this task. Furthermore, we introduce multiple human-annotated gold standard datasets for both modular and complete evaluation of this entire pipeline. In the second setting, we investigate the task of low-resource semantically ambiguous complex entities. We experiment with multiple transformer-based architectures and pre-training strategies to solve this task. Our work significantly outperforms the baseline and outperforms multiple ensemble and gazetteer-based systems. We hope that our work will help undertake further research in this critical area of Natural Language Processing. Full thesis: pdf Centre for Data Engineering |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |