Towards building Question Answering Resources for Hindi

Author: Kaveri Anuranjana
Date: 2021-01-09
Report no: IIIT/TH/2021/4
Advisor:Radhika Mamidi

Abstract

The internet has caused a revolutionary information explosion and we have more resources available than ever before. The size of just the English language Wikipedia articles is about 159.69 GB. As the resources we consume keep getting bigger and more complex, we need systems to query these datasets. Initially, to query large databases, simple Information Retrieval techniques like fetching keywords and combining them with Boolean logic results were used. But now, we have search engines that can take queries in the form of multiple keywords and even questions. The retrieval side of these search engines also started with simple Document Retrieval methods like document frequencies and inverse document frequencies and has evolved to complex ranking methods based on neural networks that can fetch entire questions. Question Answering is a major field in NLP. It combines query formulation and document retrieval and in some cases, summarization of the retrieved documents. We now have complete end-to-end based systems that can fetch a suitable answer from a corpus. These systems can be trained on large datasets - both closed domain or open domain. However, these systems rely heavily on curated datasets which require a lot of manual labour and resources. For Indian languages where reliable and substantial datasets are scarce, neural networks that rely purely on data cannot be used. Hence, providing more data and introducing data-independent techniques becomes important. We shall explore both these aspects. First, we present a Reading Comprehension Task in Hindi - HindiRC. The dataset has been divided into different grades based on reading proficiency and we perform the baseline experiment on each grade separately which shows the increasing level of difficulty. Using various linguistic cues and metrics we further prove that the grades are reflective of linguistic complexity. In addition to addressing the data scarcity, we propose a Hindi Question Generation methodology. The rule-based method is based on Computational Paninian Grammar Framework relations. No additional resources are required and it can be used to increase the number of questions for Hindi datasets. We also prove that the generation method tends to overgenerate questions; further inflating the number. Along with the dataset and question generation method, we aim to provide more resources for Question Answering in Hindi.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards building Question Answering Resources for Hindi

Abstract