Unity in Diversity: A unified parsing strategy for major Indian languages

Authors: Juhi Tandon,Dipti Misra Sharma
Conference: International Conference on Dependency Linguistics (Depling-2017 2017)
Location Pisa, Italy
Date: 2017-09-18
Report no: IIIT/TR/2017/118

Abstract

This paper presents our work to apply non linear neural network for parsing five r esource p oor I ndian L anguages belonging to two major language families-Indo-Aryan and Dravidian. Bengali and Marathi are Indo-Aryan languages whereas Kannada, Telugu and Malayalam belong to the Dravidian family. While little work has been done previously on Bengali and Telugu linear transition-based parsing, we present one of the first parsers for Marathi, Kannada and Malayalam. All the Indian languages are free word order and range from being moderate to very rich in morphology. Therefore in this work we propose the usage of linguistically motivated morphological features (suffix and postposition) in the non linear framework, to capture the intricacies of both the lan- guage families. We also capture chunk and gender, number, person information elegantly in this model. We put forward ways to represent these features cost effectively using monolingual distributed embeddings. Instead of relying on expensive morphological analyzers to extract the information, these embeddings are used effectively to increase parsing accuracies for resource poor languages. Our experiments provide a comparison between the two language families on the importance of varying morphological features. Part of speech aggers and chunkers for all languages are also built in the process.

Full paper: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Unity in Diversity: A unified parsing strategy for major Indian languages

Abstract