Identification of Multiword Expressions in Hindi

Author: Shashikant Muktyar
Date: 2020-05-27
Report no: IIIT/TH/2020/39
Advisor:Soma Paul

Abstract

Identifying Multiword Expressions (MWEs) is quite an important task when they compose almost 40 percent of the language, as in Hindi. Yet, the serious work on this started only during the last couple of decades. This is quite a challenging task as the multiword expressions do not follow any set of syntactic, semantic or lexical rules which could make them unique or different from any of the other expressions. The apparent vagueness in defining of MWEs is at the core of this problem. They have been defined in several ways by the researchers, one of them even going as far as calling them “A pain in the neck for NLP” for their idiosyncratic interpretations that cross word boundaries [Sag et al., 2002]. In this work, we try to come up with ways to semi-automatically identify the multiword expressions occurring in Hindi language. We focus our work on MWEs that can be defined on the basis of syntactic type of their constituents. Moreover, we also look at ways to identify collocations in the language that form a subset of the MWEs. We make use of some of the existing resources in Hindi, MWE extraction tools and word similarity tools to help us navigate through our problem. In a major work here, we propose a technique to identify Noun+Verb MWEs using Hindi Dependency Treebank [HDTB] as the existing knowledge base. The criteria of word similarity has been used to find an association between the test expression (a potential MWE) and an existing MWE. We hypothesize that the words which are similar tend to be part of similar idiosyncratic constructions and such expressions generally occur in sentences which has similar syntactic construction (karaka and vibhakti similarity). Thus there is a relatively higher possibility for these expressions to be MWE. We also prepare what we call a karaka chart from the HDTB knowledge base which has karaka frame details for the MWEs. Each MWE has a karaka frame, which is a list of potential karakas that the Noun+Verb expression takes for the construction of a valid and meaningful sentence. It is used to see how similar are the karaka labels of the test expression to that of its most similar MWE present in the karaka chart. The similarity is calculated using a popular word similarity tool. This work is done in an effort to get a deeper understanding of a correlation between word similarity and the karaka frame similarity. Our work on collocations delves into identifying adjective-noun bigrams using a technique proposed by us. The criteria of non-substitutability of the constituent within a collocation is a major parameter in the algorithm for collocation identification. The two commonly used methods of automatic collocation extraction, namely, t-score test and pointwise mutual information test, have also been used to perform the experiments on the same data set.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Identification of Multiword Expressions in Hindi

Abstract