Improvements to Telugu Dependency Parsing

Author: Sneha Nallani
Date: 2021-04-24
Report no: IIIT/TH/2021/67
Advisor:Dipti Misra Sharma,Manish Shrivastava

Abstract

Dependency Parsing is extremely useful in several downstream NLP tasks. However dependency parsers are not available for several Indian Languages. The primary reason for that is the unavailability of annotated treebanks. Telugu is a resource poor language and the existing Telugu treebank is very small in size. In this work, we strive to extend and improve the existing Telugu Dependency treebank annotated with Paninian dependencies released as part of the ICON 2009 shared task on Indian Language Parsing. The existing treebank consists of around 1600 sentences and we extend this treebank by another 987 sentences and clean up the existing treebank. The final extended treebank consists of 2436 sentences. The original treebank is only annotated with inter-chunk dependencies. We automatically annotate the intra-chunk dependency labels for the extended treebank using a Shift-Reduce parser based on Context Free Grammar rules written for Telugu. Annotating the intra-chunk dependencies finally provides a complete parse tree for every sentence. We also convert the treebank from Anncorra POS schema to the latest BIS POS schema. In the last decade, there has been an increased focus on multilingual NLP. The Universal Dependencies (UD) project has been created to promote multilingual research on syntax and parsing and to facilitate cross-lingual learning experiments. In this work, we automatically convert the existing Telugu Paninian dependency treebank to UD. We also try to build an easily usable and robust dependency parser for Telugu. Previous work on Telugu dependency parsing was primarily focused on either rule-based approaches or data-driven statistical approaches using Malt Parser. These approaches typically made use of intermediate tools like POS tagger, morph-analyzer, shallow parser etc which are expensive to build. Therefore, in this work we propose to bypass the pipeline approach and develop an end-to-end dependency parser for Telugu based on the recent developments in the field of NLP such as contextual vector representations. We train a BERT model on the Telugu Wikipedia data and use vector representations from this model to train the parser. Each sentence token is associated with a vector representing the token in the context of that sentence and the feature vectors are constructed by concatenating two token representations from the stack and one from the buffer. We put the feature representations through a feed forward network and train with a greedy transition based system. The resulting parser has a very simple architecture with minimal feature engineering and achieves state-of-the-art results for Telugu. We report the parser results on the Telugu Paninian dependency treebank and UD treebank.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Improvements to Telugu Dependency Parsing

Abstract