Error Detection and Dependency Parsing

Author: bhasha.agrawal@research.iiit.ac.in
Date: 2016-08-03
Report no: IIIT/TH/2016/44
Advisor:Dipti Misra Sharma

Abstract

Syntactic parsing, a major component of natural language processing, involves understanding the structure of a sentence as per given rules. Despite of enormous research in this field, syntactic parsing for Indian languages is still not at par with other languages like English. The reason being, most of the Indian languages are morphologically rich, free word-order languages (Mor- FWO) and parsing Mor-FWO languages is a challenging task. Apart from this, parsing requires sizable annotated resource called a treebank for machine learning algorithms. Treebank is a text corpus with manually annotated grammatical analysis of each sentence. Unavailability of resources for these languages is a major reason why we do not have good parsers for these languages. In this work, we have proposed methods which aid in semi-automatic development of good quality treebanks for Indian languages by proposing a method to automatically detect potential errors in treebank. With the help of this approach, manual validators can validate treebanks in less time and effort. Since the availability of Penn Treebank [44], treebanks have played a crucial role in building automatic natural language processing tools for various languages. In particular, treebanks have helped in building robust and efficient syntactic parsers. The availability of syntactic parsers is critical for the further processing of a sentence, e.g. semantic analysis [31]. We, therefore, explore methods to develop treebanks for resource poor languages because we might have huge unannotated data available with us but labeled data is very less for many languages. Anno- tation being an expensive task, if we obtain a tool which helps in building quality treebanks, development of statistical parsers would be affected a lot. Statistical parsers are generally trained on data from a singe domain and evaluation is also generally done on the same domain. While when put into natural language applications, we expect input data from any random domain. In such cases, parsers behave clumsily and their performance degrades drastically from what was evaluated. This might be because, while training parsers on data from one domain, parser might learn domain specific features which might not be valid for other domains. So, while working on these two aspects of grammatical analysis of sentences in Indian languages, I also ventured into exploring whether existing parsers can be made robust for new domains.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Error Detection and Dependency Parsing

Abstract