Semantic Textual Similarity for Hindi

Author: darshan.agarwal
Date: 2017-11-09
Report no: IIIT/TH/2017/78
Advisor:Radhika Mamidi

Abstract

Short texts play a major role in our day-to-day communication in many forms such as emails, chat, tweets, news headlines, image captions and many more. Many techniques have been developed in the field of natural language processing to automatically process such text at scale and apply it in areas apart from other NLP applications like education, law, healthcare, social platforms and security. This thesis presents automatic identification of semantic similarity between two short texts. There is a large usage of short texts and messaging in our real life. We explore and study dialogues and propose techniques for the automatic identification of subtopic boundary in Hindi dialogue using semantic similarity as a measure. We observe that there is a need for a semantic textual similarity system for Indian languages. One of the direct applications of a semantic textual similarity system is in dialogue based intelligent tutoring systems, where the answer of the user is assessed by comparing its similarity with the answer given by experts. Identification of short text similarity is an important research problem with application in a multitude of areas. Machine translation evaluation is another example, where the textual similarity between the system output and the gold standard output is used to evaluate the accuracy of the system. The direct application of STS is also in answer sentence identification in question answering applications. In natural languages, similar ideas can be linguistically expressed in very different ways, thereby making the task of semantic similarity challenging. For European languages, the task of semantic textual similarity has been studied to a great extent and various techniques have been proposed for those languages, but for Indian languages work has been very limited. This research describes our work on semantic textual similarity by proposing an annotation scheme for grading semantic similarity between two sentences. We also propose and describe methods for automatic identification of semantic similarity between short texts for Hindi language. The proposed STS algorithms are applied to tasks such as machine translation evaluation and subtopic boundary identification.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Semantic Textual Similarity for Hindi

Abstract