Semi-automated annotated treebank construction for Hindi and Urdu

Authors: Jayendra Rakesh Yeka,Ramagurumurthy Vishnu,Dipti Misra Sharma
Conference: LREC-2014: 2nd Workshop on Indian Language Data: Resources and Evaluation (WILD RE) (LREC-2014: WILD RE Workshop 2014)

Date: 2014-05-27
Report no: IIIT/TR/2014/63

Abstract

In this paper, we speak about the structure and paradigms chosen for creation of the annotated corpora for Hindi and Urdu. We briefly talk about the Shakti Standard Format that was chosen to suit needs of Indian language dependency annotation. This paper aims to present a framework for the creation of annotated corpus. We proceed to discuss the methods of automation chosen to overcome the laborious and time-consuming process of corpora annotation. We present the methods chosen to overcome the errors and multiple analyses that result through the task of annotation. We also present various methods used, both manual and automated, to ensure the quality of the treebank. We finally report the current status of the annotated corpora.

Full paper: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Semi-automated annotated treebank construction for Hindi and Urdu

Abstract