Developing a Pilot Hindi Treebank Based on Computational Paninian Grammar

Author: Rafiya Begum
Date: 2017-11-06
Report no: IIIT/TH/2017/90
Advisor:Dipti Misra Sharma

Abstract

Penn Treebank has proved the importance of treebanks as a linguistic resource for NLP. The current research presents an effort to develop a pilot treebank for Hindi, which could be used for creating a large scale treebank for Hindi. Building a treebank requires a computational grammar framework, an annotation scheme based on a chosen grammar, guidelines for annotating various types of constructions in the concerned language, and other related resources such as verb frames, etc. Since Hindi has a relatively free word order, dependency grammar formalism is well suited for it. So we chose Computational Paninian Grammar framework [36]. Panini’s grammar is a dependency grammar [99, 162]. Hence, the scheme for annotating treebanks for Indian languages was developed based on this framework. As part of this study, a pilot treebank for Hindi (HyDT – Hyderabad Dependency Treebank for Hindi) [21] was developed which was released for ICON-2009 (International Conference on Natural Language Processing-2009) [86]. The scheme [21] and guidelines for treebank annotation for Hindi developed during this study were modified and are being used for a multi-layered and multi-representational treebank for Hindi and Urdu [39, 42, 188] which is a collaborative project between various Universities. As part of this study, I annotated about 2230 sentences of Hindi from CIIL (Central Institute of Indian Languages, Mysore) corpus. Some part of data annotation was done by another annotator whose annotation was later validated by me. Various issues encountered during the annotation were studied and resolved to strengthen the scheme. I discuss some of the Hindi constructions which were intensely analyzed such as Causatives, Relative clauses, Conjunct verbs, Perception verbs, Conditionals, etc. We also looked in detail at how to treat ellipsis in a dependency Treebank (HyDT). I then present the detail study of Causative verbs [20] and Conjunct verbs [19] in which I show the classification of Hindi causative verbs and the diagnostics used to identify the conjunct verbs, respectively. The study of these Hindi constructions helped us in updating the existing guidelines for further annotation. Along with the creation of Hindi Treebank (HyDT), I also created a supplementary resource of verb frames for 687 Hindi verbs. I present the work on verb frames [22] for Hindi verbs and show the methodology used in preparing these frames and the criteria followed for classifying Hindi verbs. Verb frames were developed following Paninian Grammatical framework. Verb frames provide us the arguments that a particular verb can take in a particular sense, i.e., they show mandatory dependency relations for a verb. The main goal of this work is to create a linguistic resource which will prove to be indispensable for various NLP applications. It is also helpful in preparing demand chart for Hindi parser [84, 25]. It is also helpful for the annotators in deciding various dependency relations for a given verb in the corpus. I have also worked on the mapping between Propbank annotation and dependency annotation, based on Paninian Grammatical Framework [21, 36]. I have also discussed the use of HyDT data (Hyderabad Dependency Treebank for Hindi) [21] in various experiments.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Developing a Pilot Hindi Treebank Based on Computational Paninian Grammar

Abstract