IIIT Hyderabad Publications |
|||||||||
|
Developing Language Technology Tools and Resources for Sindhi: A Resource-Poor LanguageAuthor: Raveesh Motlani Date: 2018-07-28 Report no: IIIT/TH/2018/58 Advisor:Dipti Misra Sharma,Manish Shrivastava AbstractSindhi is an Indo-Aryan language, which is spoken by about 53 million people in Pakistan and about 5.8 million people in India. Sindhi is also one of the 22 official languages in India. Despite all these statistics showing how widely spoken Sindhi is, it is still a computationally resource poor language. Development of natural language applications for any language is possible with the help of linguistic resources and computational tools for that language. In this work, we have developed some fundamental resources and tools that shall help natural language processing of Sindhi language. We have developed raw and part-of-speech (POS) annotated corpus for Sindhi Devanagari and subsequently created a Conditional Random Fields (CRF) based automatic POS Tagger that yields an accuracy of 91.78%. We have also built a paradigm based finite-state morphological analyser for Sindhi Perso-Arabic using Apertium’s lttoolbox. This morphological analyser currently has about 3500 entries and a coverage of more than 81% on Sindhi Wikipedia consisting of 341.5k tokens. We worked on Sindhi Perso-Arabic because the corpus of Sindhi Devanagari was very small and we needed very large corpus for good coverage of the vocabulary of the language, which was available in Perso-Arabic. To diminish the script barrier, we also worked on transliteration. We developed a rule-based transliteration system between Sindhi Devanagari and Sindhi Perso-Arabic which yields 91.33% accuracy. We have also conducted experiments to demonstrate resources leveraging and sharing among the scripts through transliteration. These include, generating more data for Sindhi Devanagari to bootsrap POS tagger and building POS tagger for Sindhi Perso-Arabic. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |