IIIT Hyderabad Publications
Context Based Morphological Analysis
Author: Deepak Malladi
Report no: IIIT/TH/2016/4
Advisor:Dipti Misra Sharma
Abstract Morphological analysis is a fundamental task in most Natural Language Processing (NLP) applications, especially for morphologically rich languages such as Hindi. Morphological analysis for Hindi involves predicting lemma, POS, gender, number, person, case-marker, TAM and vibhakti. In general, morphological analyzers predicts all possible analyses for a given word. For most of the NLP tasks, instead of having multiple multiple analyses for a word, we need to disambiguate those multiple analyses and arrive at one analysis which best fits in the given sentential context. The prime motivation for carrying out the research in this thesis comes from the initial set of Hindi parsing experiments carried out by us. The existing morphological analyzers for Hindi do not predict context based morphological analysis. Because of lack of automatic context based morphological information, parsing accuracy could not be improved. In this thesis, we try to predict a single analysis for a word in a given context. This thesis deals with predicting context based morph information for the attributes viz. lemma, gender, number, person, case-marker, TAM and vibhakti. The existing analyzers also perform poorly for Out-Of-Vocabulary (OOV) words. We also aim to address this issue and make an exhaustive evaluation of our predictions for OOV words. For lemma prediction, we adopt a machine translation approach. For gender, number, person and case-marker prediction we perceive it as a classification problem. TAM and vibhakti are better predicted by rule based approach. For lemma, gender, number, person and case prediction, we achieved an overall accuracy and OOV accuracy of 84.25% and 63.06% respectively. To present our case that the predicted morphological information helps in NLP applications, we carry out parsing experiments without and with the predicted morph information and report Labeled Attachment Score of 87.75% and 89.41% for both the experiments respectively. Building machine translation models are time complex. Hence, in the later part of the thesis we conceive lemma prediction also as a classification problem. We also experiment with other features for gender, number, person and case prediction. We report overall and OOV accuracies of 85.87% and 65.96% respectively. We have seen an 1-2% improvement from our earlier set of experiment results. We extend our approach to other Indian languages viz. Urdu and Telugu for predicting context based morphological information.
Full thesis: pdf
Centre for Language Technologies Research Centre
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved.