IIIT Hyderabad Publications
A Shallow Parser for Malayalam
Report no: IIIT/TH/2016/49
Advisor:Dipti Misra Sharma
Malayalam is an agglutinative and morphologically rich language as any other Dravidian language. Computational processing of Dravidian languages is not trivial because two or more words can join to form a string of words with a morpho-phonemic change at the point of joining. This process known as “Sandhi”, in turn complicates the individual word identification. The current work is an attempt to break the barrier of word segmentation and to create a shallow parser for Malayalam, which facilitates non-recursive phrase identification given an input. In this work, Shallow Parser has mainly 3 modules namely Sandhi Splitter, POS Tagger and Chunker. Since words are the basic components in the sentence, not identifying the individual words in a sentence will affect the output of a shallow parser. Hence in order to tackle the problem of “Sandhi”, after attempting a few rule-based approaches, we arrived at a hybrid “Sandhi Splitter” which gave an overall accuracy of 87% . This system uses Naive Bayes classifier to identify the split point and hand crafted character-level rules to induce morpho-phonemic changes. A CRF based Parts-Of-Speech tagger has been employed for the identification of grammatical category of words. Various experiments with different templates of features showed that Malayalam has more dependence over word-internal features like prefix and suffix information than word-external features like the position of a word in a sentence. The highest overall accuracy we obtained is 91.25%. The final module “chunker” has been employed to find out the non-recursive phrases based on the Parts-Of-Speech tags of the words. This CRF based chunker gave an overall accuracy of 94.33%. Error propagation is a problem in shallow parsing. Errors created by each module affect the subsequent modules. When each module is put in a pipeline in such a way that the output of the previous module will be the input of the next module, overall accuracy of shallow parser came down to 71% from 92%. This shows the need and importance of a highly accurate sandhi splitter which is the main source of error. It is important to note that the experiments clearly show that morphology is the key factor for processing Malayalam.
Full thesis: pdf
Centre for Language Technologies Research Centre
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved.