IIIT Hyderabad Publications |
|||||||||
|
Advancements to Hindi Dependency Parsing: Semantic Information, Ensembling and PQEAuthor: Naman Jain Date: 2016-05-17 Report no: IIIT/TH/2016/18 Advisor:Dipti Misra Sharma AbstractNatural Language Processing (NLP) is a challenging field in the area of artificial intelligence and computational linguistics. In simplistic terms, it can be defined as processing a natural language in any form either speech or written text. Though extensive research has been done for many of the languages in the world, Indian languages are still lagging behind in the race. Processing of any natural language requires analysis to be done at multiple levels like word-level, phrase-level, sentence-level, semantic-level and higher levels of pragmatic and discourse. In this work we are presenting our efforts of making new advancements at the sentence level, which in linguistics terms, is regarded as Syntactic Parsing. Syntactic parsing involves establishing relations between different words of a sentence to convey the possible meaning. Indian languages are morphologically rich and exhibit free-word order (MoR-FWO). Dependency Parsing, a type of syntactic parsing is better suited to such languages. Our efforts start from delivering a state-of-the-art Hindi Dependency Parsing system through the platform provided in the form of a shared task (Sharma et al., 2012). We employed a data-driven transition-based statistical system (Malt Parser), trained on Hindi Dependency Treebank (Bhatt et al., 2009; Palmer et al., 2009). Error-analysis performed in the task, helped us to target the problems in a more specific manner. In the next phase, to target some of the problems like case ambiguity, data sparsity, lack of case marker, etc., we aided the process of dependency parsing by enriching the training model with semantic information. The information is extracted automatically from a rich lexical resource, Hindi Word- Net (Narayan et al., 2002). Learning from the insights obtained in this advancement process, we moved to another well-established approach, Ensembling. Ensembling works on the principle of exploiting diversity of multiple parsing systems and combining their strengths to improve the parsing performance. We explored two ensembling approaches namely, re-parsing algorithms and word-by-word voting, using six different weighting strategies to combine six algorithmic variants of Malt parser. Improvements had been observed in the second approach. After establishing a systematic comparison between both the techniques of ensembling, we obtained the lead to search for better weighting strategy to improve ensembling. The search ended with Parse Quality Estimation (PQE) score. Adapting the work done in the past, we extended this functionality for our purpose of performing ensembling. The approach, which has failed earlier, has now shown improvements. Further, we also expanded the scope of PQE score for dependency arcs (attachment) to capture confusion made by the oracle in the parsing system. The functionality had also been extended for Joint prediction of both arcs and labels simultaneously. To prove the efficacy of the approach, we implemented several real-world applications of PQE score. Finally, we proposed a robust evaluation framework in terms of Domain Adaptability (DA) and Inter-Language Portability (ILP), to better judge the effectiveness of Hindi Dependency Parser. During this evaluation process, using the property of portability, we also built dependency parsers for two of the Dravidian languages: Tamil and Telugu, which can be integrated easily in real world NLP systems like Machine Translation. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |