Constituency Parser for Hindi Noun Sequences and Role of Bracketing in Translation of English Compound Nouns into Hindi

Author: Arpita Batra
Date: 2017-12-22
Report no: IIIT/TH/2017/93
Advisor:Soma Paul

Abstract

Complex noun sequences in Hindi can be formed by the sequences of nouns and genitives. In Hindi, the genitive marker is “k¯a”, and its allomorphic variations are “ke” and “k¯i”. When two or more nouns occur without any intervening post-positions, it is known as compound noun. Following are some examples of complex noun sequences: (1) “jil¯a cun¯ava adhik¯ar¯i” (district election officer), (2) “tila k¯i mit.h¯a¯i k¯i duk¯ana” (shop of sweets made with sesame) and (3) “upabhokt¯a ad¯alata ke vak¯ila” (consumer court’s lawyer). The rightmost noun is the head of the whole construction. The inner structure of the sequence can be quite complex. In it, (a) nouns within the sequence can modify the rightmost head or (b) the local head can modify another local head or the head of the complex noun sequence. For example, in (1), “adhik¯ar¯i” is the head and both “jil¯a” and “cun¯ava” are modifying “adhik¯ar¯i” thus having a structure (jil¯a (cun¯ava adhik¯ar¯i)). But, the complex sequence in (2) has a structure where “tila” modifies “mit.h¯a¯i” and “mit.h¯a¯i” in turn modifies “duk¯ana”. So the structure is ((tila k¯i mit.h¯a¯i) k¯i duk¯ana). More number of nouns within a sequence, more complex is the structure. From the Hindi Treebank data, we have obtained 85.37%, 12.54% and 1.80% of the sequences having three, four and five nouns respectively. In this thesis, we attempt to bracket the local sub-structure of a complex noun sequence which is termed as constituency parsing. Constituency parsing recursively builds the inner structure of the complex noun sequence. It is a very significant NLP task because the interpretation of sequence depends on the correct identification of its inner structure. We explore both syntactic and statistical method for predicting the bracketing of the complex noun sequences. In Hindi, the genitive marker agrees with the head of the sub-sequence modified by it. This clue has been used in our syntactic approach. In statistical approach, we have mainly exploited the affinity factor of a head and its modifier based on the frequency of occurring together in the corpus. The method has been augmented by introducing the semantic class information for the head and modifier nouns from Hindi WordNet. Finally, we combine the two methods and implement a hybrid approach for bracketing complex noun sequences. Using this, we have obtained 85.85% accuracy. In this thesis, we show that the identification of the inner structure of complex noun sequence helps in determining the translation. For this experiment, we take three-word noun compounds of English and translate them into Hindi. The strategy of the translation is determined by our observation of English-Hindi parallel corpora where we observe (and others have reported also) that English licenses multiword noun compound more frequently than what Hindi does. Hindi prefers syntactic phrases where a genitive post-position is inserted between the head and the modifier. In the case of compounds with three nouns, we propose that inner bracketing helps in determining the insertion of the genitive marker. For the left bracketing structure ((e1 e2) e3), we predict that the genitive will be inserted before e3. For the right bracketing structure (e1 (e2 e3)), the genitive will be inserted after e1. This hypothesis is tested, and we get the accuracy of 68.86%, 71.99% and 59.12% for BNC, ukWaC and ILCI corpus respectively.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Constituency Parser for Hindi Noun Sequences and Role of Bracketing in Translation of English Compound Nouns into Hindi

Abstract