Cost Effective Dependency Parsing for Indian Languages

Author: Aniruddha Tammewar
Date: 2016-06-15
Report no: IIIT/TH/2016/21
Advisor:Dipti Misra Sharma

Abstract

Indian languages are MoR-FWO and hence differ from English in structure and morphology. There are many distinguished characteristics possessed by Indian languages. While working with these languages we have to keep in mind, these characteristics and plan strategies accord- ingly. We worked on improving Dependency Parsing for Indian Languages, more specifically for Hindi, an Indo-Aryan Language. In the conventional Dependency Parsing methods, the focus has been on developing robust data driven dependency parsing techniques. This initiated efforts in creating hand annotated large treebanks, consisting of hand annotated features. These treebanks serve as input for the training of data-driven parsers. The annotations in Indian Languages’ treebanks are generally multi-layered and furnish information on part of speech category of word forms, their morpho- logical features, related word groups and the syntactic relations. For improvements, richer and richer features are being added. This process of manual annotation is expensive, as it requires a lot of human efforts. It is a tedious task to create treebanks for all the languages. Even if we make the treebanks available, in the real time scenario we require many tools to extract features automatically. Building such tools is also a complex task. We are in an era with almost unlimited access to raw data. Nevertheless, we often struggle to make sense of most of it. Much of this data is unlabeled and thus useless in many of the traditional supervised ma- chine learning scenarios, that require explicit labeled/hand-annotated examples. In this work, we present our efforts towards exploring cost effective approaches for building and improving parsers for resource-poor languages. For this purpose we try to use unsupervised techniques to extract features from the largely available mono-lingual raw corpus. Using cross-lingual treebank transfer, we exploit the available treebanks for other languages and using some techniques like MT and try to generate a treebank for the target language. We can use this treebank for training of parser. We first try this approach for Hindi. An important constraint for using this approach is that the annotation of treebank needs to be similar cross linguistically. For this, we use UD framework. Universal Dependencies is an initiative to create cross-linguistically consistent treebank annotation for many languages, with the goal of vii facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. In the previous studies, we have seen that CPG framework is better suited for Indian Languages. So we try to compare other techniques with the cross-lingual parsing in UD framework. In the concluding work, we try to make use of Vector Space Modeling on a large monolingual raw data, a recent technique being used widely across different tasks. Use of large monolingual corpus helps reducing the problem of data sparsity. We try to explore this technique to achieve three goals. Using word-embeddings extracted from vector space modeling as features, we first try to improve the state-of-the art accuracy for Hindi. Here we use word-embeddings as additional features other than the conventional features. The second goal is to help building parsers for less resourced languages. This is done by replacing the costly linguistic features with word-embeddings. This requires minimal human annotation. The third goal we achieve, is improving parser’s performance in general domain data. We show results where parser is trained on News domain and input sentences are from four different domains Box-Office, Cricket, Gadget and Recipe.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Cost Effective Dependency Parsing for Indian Languages

Abstract