Developing Semantic Role Labeler for Hindi and Urdu

Author: Maaz Nomani
Date: 2017-07-21
Report no: IIIT/TH/2017/63
Advisor:Dipti Misra Sharma

Abstract

Proposition Bank or PropBank is a corpus in which the arguments of each verb predicate (simple or complex) are marked with their semantic roles [Kingsbury and Palmer, 2003]. PropBank establishes a layer of semantic representation in a Treebank already annotated with phrase structure or dependency structure. Adding the semantic layer through predicate-argument structure is quite challenging and these challenges can be comprehended from the fact that the syntactic structures in which a predicate’s arguments and its adjuncts are realized can vary based on the senses present in the sentence. Our effort of building a Proposition Bank for an Indian language, Urdu, [Anwar et al., 2016] spoken in major parts of India and Pakistan, describes the annotation process of labelling the predicate-argument structure on top of a Urdu Dependency Treebank [Bhat and Sharma, 2012]. The need for such a language resource arises from the fact that while Urdu Treebank does provide syntactico-semantic information, exploiting complete semantic information present in the form of predicate-argument structure in sentences will help in dependency parsing. Adding a semantic layer to the Dependency Treebank will help in addressing this requirement. The Urdu Proposition Bank is annotated by two annotators. With inter-annotator agreement statistics, we showed that there is almost perfect agreement between the two annotators which implies their analogous understanding of the annotation guidelines and of the linguistic phenomenon present in the Urdu language. We also introduce a statistical Semantic Role Labeler for Hindi and Urdu, two major Indian languages [Anwar and Sharma, 2016]. A Semantic Role Labeler automatically marks the arguments/valency of a predicate in a predicate argument structure of a sentence. The proposed system is based on supervised machine learning approach on Hindi and Urdu PropBanks which are being built for these languages [Vaidya et al., 2011] [Anwar et al., 2016]. Our approach is a 2-stage architecture in which, first the arguments pertaining to a predicate in a sentence of PropBank are identified by the system and finally those identified arguments are classified into one of the semantic labels of PropBank. Our system uses Logistic Regression machine learning algorithm and Support Vector Machines [Pedregosa et al., 2011] to predict and classify the arguments of a predicate respectively into one or more classes which happens to be the PropBank labels or semantic labels used to build the PropBanks. We show that our system has substantial precision of around 75% while identifying the arguments for Urdupredicates and 86% while doing so for Hindi. While classifying these identified arguments in semantic roles for Urdu and Hindi PropBank, our system reached a precision of around 83% and 58% respectively. We also experimented with automatic dependency parses extracted from dependency parsers and used them as features for our semantic role labeler. We wanted to observe how our semantic role labeler will perform when exposed to automatic dependency parses. As expected, there was a drop in the accuracy of the labeler when automatic parses were used as compared to when gold parses were used. To the best of our knowledge, this is the first such attempt of building a semantic role labeler for any Indian language. Past researches related to dependency parsing of Indian languages have shown that semantic information present in the sentences do provide important cues in parsing [Bharati et al., 2008] [Ambati et al., 2009] [Jain et al., 2013] [Bhat and Sharma, 2013]. After semantic role labeler for Hindi and Urdu was available, we wanted to study and observe the impact of semantic labels given by semantic role labeler in parsing. Therefore, we built an automatic pipeline using dependency parser and semantic role labeler. The semantic role labeler using automatic dependency labels gives semantic labels which subsequently are used as features for the dependency parser. The intuition behind building such a pipeline is that we wanted to exploit the accuracy and performance of parser when exposed to automatic semantic labels. We also worked on Inter-chunk Parsing and Intra-chunk Parsing. We performed several experiments to analyze the involvement of post-positions or case-markers in parsing. One such experiment included splitting the data-set available into 10 folds and analyzing the results of parser in each and every fold to get a better sight of performance of the parser.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Developing Semantic Role Labeler for Hindi and Urdu

Abstract