Exploiting Linguistic Knowledge to Address Representation and Sparsity Issues in Dependency Parsing of Indian Languages

Author: Riyaz Ahmad Bhat
Date: 2017-03-15
Report no: IIIT/TH/2017/10
Advisor:Dipti Misra Sharma

Abstract

Recent trends in natural language processing (NLP) show ever increasing popularity of dependencybased analysis of natural language texts. Dependency representations offer simplicity, compactness and transparent encoding of predicate-argument structure. In the last decade and a half, a number of dependency treebanks have been built and various parsing algorithms have been proposed for automatic dependency analysis across wide range of languages. Over the years, it has been observed that morphologically rich and free word order languages, unlike fixed word order languages, are harder to parse, regardless of the parsing technique used. On the one hand, rich morphology provides explicit cues for parsing, while on the other hand it worsens the problem of data sparsity as it leads to high lexical diversity and variation in word order. In this thesis, we aim to address this trade-off for accurate and robust parsing of morphologically rich Indian languages. We present novel strategies to effectively represent morphology in the parsing models and also to mitigate the effect of its trade-offs. We propose to represent morphosyntactic information as higher-order features under the Markovian assumption. More specifically, we use the history of a transition-based parser to extract and propagate morphological information such as case and grammatical agreement as higher-order features for parsing nominal nodes. Despite its benefits, rich morphology can also pose a multitude of challenges to statistical parsing. The most prominent issue is related to sampling bias towards canonical structures of a language. As current parsers are mostly trained on formal texts, even a slight deviation from canonical word order can severely affect their performance. To overcome this bias, we propose a sampling technique to generate training instances with diverse word orders from the available canonical structures. We show that linearly interpolated models trained on diverse views of the same data can effectively parse both canonical and non-canonical texts. Similarly, to mitigate the effect of lexical sparsity, we use supervised domain adaptation techniques for training parsers on lexically more diverse annotations from augmented Hindi and Urdu treebanks. We demonstrate that a feedforward neural network-based dependency parser trained on augmented, harmonized Hindi and Urdu data performs significantly better than the parsing models trained separately on their individual datasets. Furthermore, we explore lexical semantics as a viable alternative to more training data for parsing semantically rich but sparse dependency annotations in Indian language treebanks. We show that lexical semantics in the form of discrete and continuous features such as ontological categories, Brown clusters and word embeddings can play a major role in disambiguating highly rich dependency relations.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Exploiting Linguistic Knowledge to Address Representation and Sparsity Issues in Dependency Parsing of Indian Languages

Abstract