“Answer ka type kya he?” Learning to Classify Questions in Code-Mixed Language

Authors: Chandu Khyathi Raghavi,Manoj Chinnakotla,Manish Shrivastava
Conference: International World Wide Web Conferences Steering Committee Republic and Canton of Geneva, Switzerland ©2015 (http://www.www2015.it/documents/proceedings/companion/p853.pdf 2015)

Date: 2015-05-18
Report no: IIIT/TR/2015/53

Abstract

Code-Mixing (CM) is defined as the embedding of linguistic units such as phrases, words, and morphemes of one language into an utterance of another language. CM is a natural phenomenon observed in many multilingual societies. It helps in speeding-up communication and allows wider variety of expression due to which it has become a popular mode of communication in social media forums like Facebook and Twitter. However, current Question Answering (QA) research and systems only support expressing a question in a single language which is an unrealistic and hard proposition especially for certain domains like health and technology. In this paper, we take the first step towards the development of a full-fledged QA system in CM language which is building a Question Classification (QC) system. The QC system analyzes the user question and infers the expected Answer Type (AType). The AType helps in locating and verifying the answer as it imposes certain type-specific constraints. We learn a basic Support Vector Machine (SVM) based QC system for English-Hindi CM questions. Due to the inherent complexities involved in processing CM language and also the unavailability of language processing resources such POS taggers, Chunkers, Parsers, we design our current system using only word-level resources such as language identification, transliteration and lexical translation. To reduce data sparsity and leverage resources available in a resourcerich language, in stead of extracting features directly from the original CM words, we translate them commonly into English and then perform featurization. We created an evaluation dataset for this task and our system achieves an accuracy of 63% and 45% in coarse-grained and fine-grained categories of the question taxonomy. The idea of translating features into English indeed helps in improving accuracy over the uni-gram baseline.

Full paper: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

“Answer ka type kya he?” Learning to Classify Questions in Code-Mixed Language

Abstract