Automatic Information Retrieval for Short Documents: Pubmed Articles and Bug Reports

Author: Jaspreet Singh
Date: 2022-05-26
Report no: IIIT/TH/2022/61
Advisor:Vikram Pudi

Abstract

With the ever-increasing size of documents on the web, tasks such as relevant information extraction with a query and similar document retrieval have become big challenges. Typical information retrieval and classification systems have been posed with questions like – Which documents match the given short text query? Which documents are similar to the given query document? Are the given two query documents possible duplicates? Researchers have been trying to represent documents and queries to answer these questions. Building frameworks for extracting such representations have been crucial to information retrieval research. Handcrafted textual features have been traditionally utilized to represent documents. Recently, advanced learning models such as transformers relying on attention mechanism have formed the basis of learning techniques proven to push the state of the art in language processing. These models have been built to sentence and document representations, however, their applications in extracting domainspecific inter-document information have been limited. Documents are found in various domains – medical, scientific journals, software engineering etc. Short documents are more prevalent in empirical medicine in the form of evidence-based systematic reviews and large scale software repository in the form of bug reports. The data entry design of both these systems has been primarily free-text with weak structure guidance. The large and growing number of published studies makes the task of identifying relevant studies in an unbiased way becomes complex and challenging. Similarly in the context of bug repository, with the rapid increase in features and codebase in large scale repositories, the task of triaging and extracting potential duplicates becomes time-consuming. Thus, this thesis aims at providing document retrieval and representation methods by enhancing query formulation and generating effective document embeddings. Firstly we devise a method of assisting Cochrane experts with writing systematic reviews of a study. In order to write a systematic review, researchers have to conduct a thorough search over the published articles relevant to a study. We conduct experiments that test the effectiveness of information retrieval methods to overcome the limitations of boolean search and title/abstract screening of documents. We provide a method using expansion techniques according to Relevance Feedback to retrieve relevant documents. The feedback mechanism uses term boosting over multiple iterations constructs grading feedback for ranking documents. Secondly, we focus on representing documents over large databases to improve retrieval performance while capturing semantic concept described in natural language. We present a novel method of representing a bug report by utilizing inter-document context. We use pre-trained transformers with a unique choice of triplet formulation to learn rich document embeddings. Experimentation with downstream tasks like duplicate bug detection and classification performs better than the existing methods in terms of recall rates.

Full thesis: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

Automatic Information Retrieval for Short Documents: Pubmed Articles and Bug Reports

Abstract