IIIT Hyderabad Publications |
|||||||||
|
Towards Information Retrieval for Scholarly Document ProcessingAuthor: Amit Pandey 2020702009 Date: 2024-05-24 Report no: IIIT/TH/2024/67 Advisor:Vikram Pudi AbstractThe relentless growth of scholarly publications, exemplified by the annual publication rate exceeding 5 million articles, has posed a formidable challenge for researchers seeking efficient literature review methodologies. Systematic Literature Reviews (SLRs), crucial for understanding existing knowledge and identifying research gaps, are hindered by the manual extraction of information, contributing to extended timelines and potential obsolescence. This thesis addresses the urgent need for improved literature review methodologies by focusing on two challenges: Cited Text Span Retrieval (CTSR) and Named Entity Recognition (NER). CTSR involves identifying cited text spans, facilitating the tracing of information origin, while NER identifies and categorizes entities within the text. In this thesis, we introduce CitRet, a hybrid model for CTSR, leveraging semantic and syntactic characteristics of scientific documents and outperforming existing methods on the CLSciSumm shared tasks. Using only 1040 documents for finetuning, CitRet achieves a remarkable over 15% improvement in the F1 score evaluation. Further, we explore Complex NER for English, a non-trivial task of identifying rare and semantically ambiguous entities. Utilizing pre-trained language models, our models consistently outperform the baseline, with the best model advancing the baseline F1-score by over 9%. Expanding our scope to complex NER for low-resource languages, we leverage pre-trained language models for Chinese and Spanish. Employing Whole Word Masking (WWM) to enhance the Masked Language Modeling objective, our models, incorporating CRF, BiLSTMs, and Linear Classifiers, outperform the baseline by a significant margin. The best-performing model attains a competitive position on the evaluation leaderboard for the blind test set. This work aims to catalyze further research in the challenging domain of ambiguous, low-resource, complex NER. By addressing CTSR and NER, our thesis contributes significantly to the broader goal of enhancing systematic literature reviews (SLRs). Integration of these tasks provides a structured and comprehensive approach to navigating the vast scientific publication landscape, easing the burden on researchers and promoting a more efficient dissemination of knowledge. Full thesis: pdf Centre for Data Engineering |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |