IIIT Hyderabad Publications |
|||||||||
|
Advancements in Dependency Parsing for Indian LanguagesAuthor: Juhi Tandon Date: 2018-07-09 Report no: IIIT/TH/2018/36 Advisor:Dipti Misra Sharma AbstractIndian languages are morphologically rich and free word order languages. There are many other distinguished characteristics possessed by them. Keeping these in mind Computational Paninian Grammar formalism was chosen to represent the syntacto-semantic relations between different words in a sentence and establish their relationship with the verb, for these languages. The syntactico-semantic dependency relations and their labels defined in the CPG formalism are very fine grained to account for the rich grammatical functions. The number of distinct dependency labels are 82 as per the scheme (both interchunk and intrachunk). It has been observed that the more semantically oriented annotation schemes make labeled parsing more difficult than the schemes based on more surface-oriented grammatical functions. These relations can be organised in the form of a hierarchical structure with each level representing a degree of granularity which can be underspecified. We aim to explore whether reducing granularity of these labels can help bridge the gap between between Labeled Score and Unlabeled Attachment Score. Universal Dependencies (UD) on the other hand have a coarser scheme of grammatical representation. We extend UD to Indian languages through conversion of Paninian Dependencies to UD for the Hindi Dependency Treebank (HDTB) and observe the effects of a relatively sparse taxonomy. We discuss the differences in annotation in both the schemes, present parsing experiments for both the formalisms and empirically evaluate their weaknesses and strengths for Hindi. We produce an automatically converted Hindi Treebank conforming to the international standard UD scheme, making it useful as a resource for multilingual language technology. Indian language treebanking was undertaken to create resources for facilitating data driven syntactic analysis. The annotations in Indian Languages’ treebanks are generally multi-layered and furnish information on part of speech category of word forms, their morphological features, related word groups and the syntactic relations. Rich syntactic features are very important for building state-of-the-art syntactic analysers but they require lot of feature engineering expertise. These indicative features pose the problem of data sparsity, incompleteness and expensive extraction. Building expensive tools to automatically generate these features is an expensive task. Keeping this in mind our work proposes to apply non linear neural network for parsing five resource poor Indian Languages belonging to two major language families – Indo-Aryan and Dravidian. Bengali and Marathi are Indo-Aryan languages whereas Kannada, Telugu andMalayalam belong to the Dravidian family. The non linear architecture elegantly addresses all the problems mentioned above. While little work has been done previously on Bengali and Telugu linear transition-based parsing, we present one of the first parsers for Marathi, Kannada and Malayalam. All the Indian languages are free word order and range from being moderate to very rich in morphology. Therefore in this work we propose the usage of linguistically motivated morphological features ( suffix and postposition ) in the non linear framework, to capture the intricacies of both the language families. We also capture chunk and gender, number, person information elegantly in this model. We put forward ways to represent these features cost effectively using monolingual distributed embeddings. Instead of relying on expensive morphological analyzers to extract the information, these embeddings are used effectively to increase parsing accuracies for resource poor languages. Our experiments provide a comparison between the two language families on the importance of varying morphological features. Part of speech taggers and chunkers for all languages are also built in the process. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |