Simple and Effective Monolingual and Code-Mixed Question Answering

Author: Vishal Gupta
Date: 2019-07-25
Report no: IIIT/TH/2019/88
Advisor:Manish Shrivastava,Manoj Chinnakotla

Abstract

Homo sapiens have evolved over centuries owing to their curiosity. We have travelled millions of miles and spent countless years in quest for answers. Fast forward to today, the invention of the internet has democratised access to enormous amounts of information at the click of a button. For the past two decades, search engines like Google have been the primary way of finding information online. However, search engines break down when a user presents complex information expressed as natural language questions. Additionally, as more people access the web from mobile devices on the move, the need for software that can interpret and answer questions becomes more important. Recently, voice assistants aim to provide us with natural and declarative access to information. The proliferation of the Internet in newer demographics has led to many new phenomena. One such phenomenon called “code-mixing” has been frequently observed in online social media and has attracted a lot of research and interest from sociolinguists. Code-mixing (CM) is the mixing of linguistic units, such as phrases, words or morphemes, from multiple languages. It is prevalent in countries where people speak multiple languages with native proficiency such as India. It helps in speeding-up communication and allows a wider variety of expression due to which it has become an accessible mode of expression on the web. Hence, there is a need for the development of software to allow these communities to communicate naturally and ask questions in code-mixed. In this thesis, we study the design of an effective Question Answering (QA) framework for QA over Knowledge Bases (KB). A given question may have multiple candidate answers amongst which we have to pick the correct one. We adopt a two-step approach: candidate (answer) generation and candidate (answer) re-ranking to answer questions. We propose a Triplet-Siamese-Hybrid CNN (TSHCNN) to re-rank candidate answers. We show experiments on the SimpleQuestions dataset which uses Freebase as the underlying KB to answer questions. We demonstrate the effectiveness of our approach on monolingual question answering task for English, beating previous benchmarks. We also study how existing QA resources for English and bilingual embeddings can be effectively used in a resource-constrained setting such as for monolingual Hindi and CM QA. We discuss the creation of our QA dataset for CM questions, study the effectiveness of our QA framework and do an in-depth analysis of approaches that work for CM questions. We show that our system that works with CM questions in its original form fares better compared to translating them and using a QA system trained on monolingual English to obtain answers.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Simple and Effective Monolingual and Code-Mixed Question Answering

Abstract