Retrieving Semantically Similar Questions in Community Question Answering

Author: arpita.das
Date: 2017-07-14
Report no: IIIT/TH/2017/70
Advisor:Manish Shrivastava,Manoj Chinnakotla

Abstract

Internet users today prefer getting precise answer to their questions rather than sifting through a bunch of relevant documents provided by search engines. This has led to the huge popularity of Community Question Answering (cQA) services like Yahoo! Answers, Baidu Zhidao, Quora, StackOverflow etc., where forum users respond directly to questions with short targeted answers. These forums provide a platform for interaction with experts and serve as popular and effective means of information seeking on the Web. Anyone can obtain answers to their questions by posting them for other participants on these sites. Community can also decide the quality of answers for a question. Over time, such cQA archives have become rich repositories of knowledge encoded in the form of questions and user generated answers. However, not all questions get immediate answers from other users. If a question is not interesting enough for community or if similar question is already answered by some other user, it may suffer from “starvation”. Such questions may take hours and sometimes days to get satisfactory answers. This delay in response can be avoided by searching similar questions in the very large archives of previously asked questions. If a similar question is found, then the corresponding best answer can be provided without any delay. The main challenge while retrieving similar questions is the “lexico-syntactic gap” between the user query and the questions already present in the forum. The aim is to detect question pairs that differ from each other lexically and syntactically but expresses the same meaning. In this thesis, we propose two novel approaches to bridge the lexico-syntactic gap between the question posed by the user and forum questions. In the first approach, we design “Deep Structured Topic Model (DSTM)” which retrieves similar questions that lie in the vicinity of the latent topic vector space of the query and the archived question-answer pairs. The retrieved topically similar questions are reranked using a deep semantic model. In the second approach, we explore the behaviour of deep semantic models with “parameter-sharing” between the parallel networks which help us to design “Siamese Convolutional Neural Network for cQA (SCQA)”. It consists of twin convolutional neural networks with shared parameters and a contrastive loss function joining them. It learns the similarity metric for question-question pairs by leveraging the question-answer pairs available in cQA forum archives. The model projects semantically similar question pairs nearer to each other and dissimilar question pairs farther away from each other in the semantic space. Several models have been built in the past to bridge the lexico-syntactic gap in the cQA content. However, considering the ever growing nature of the data in cQA forums, these models cannot be kept stagnant. They need to be continuously updated so that they can adapt to the changing patterns of questions-answers with time. Such updation procedures are expensive and time consuming. In this thesis, we propose a novel Topic model based active sampler named Picky. It intelligently selects a smaller subset of the newly added question-answer pairs to be fed to the existing model for updating it. Experiments on large scale real-life “Yahoo! Answers” dataset reveals that DSTM and SCQA outperforms current state-of-the-art approaches based on translation models, topic models and existing deep neural network based models. Also, evaluations on real life c QA datasets show that “Picky” converges at a faster rate, giving comparable performance to other baseline sampling strategies updated with data of ten times the size.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Retrieving Semantically Similar Questions in Community Question Answering

Abstract