Development and Enhancement of Tools and Resources for Urdu Text Processing

Author: Aamir Farhan 20161078
Date: 2023-04-17
Report no: IIIT/TH/2023/22
Advisor:Dipti Misra Sharma

Abstract

Urdu writing system is derived from the Persio-Arabic writing systems and thus it has adopted similar orthographical and morphological characteristics as that of Persio-Arabic languages. The first and foremost task for most of the NLP applications is Word Segmentation which involves identifying the bounding boundaries of words in written text. It is quite crucial to accurately identify the boundaries of each word in written text because all the downstream tasks in NLP are dependent on it, thus making Word Segmentation fundamentally important. Urdu adopts a continuous writing style which does not have an explicit and clear marker for word boundary. Furthermore, the inherent non-joining attributes of certain characters in Urdu create spaces within a word while writing in digital format. Thus, Urdu not only has space omission but also space insertion issues which make the word segmentation task challenging. We have studied and categorized the various issues that are observed with respect to the inconsistent usage of space character in Urdu script along with the orthographic and morphological reason behind it. Another challenge in computational processing of Urdu is the lack of benchmark resources and corpora for Word boundary identification. Leveraging the learning from the orthographic study of Urdu writing system, we have built a benchmark corpus for Urdu Word Segmentation, with an exercise of manual annotation, using white space as word boundary and Zero-Width-Non-Joiner (ZWNJ) character as sub-word boundary. A Conditional Random Field based sequence modeler was then used to train a character-level label prediction of a sequence of Urdu characters. Our model achieved state-of-the-art results with an F1 score of 0.98 for word boundary identification. Furthermore, we have applied our word segmentation model on studying the sociological phenomena of Diglossia in Urdu

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Development and Enhancement of Tools and Resources for Urdu Text Processing

Abstract