Universal Dependency Parsing of Hindi-English Code-switching

Author: Irshad Ahmad Bhat
Date: 2018-06-30
Report no: IIIT/TH/2018/44
Advisor:Manish Shrivastava

Abstract

Code-switching is a phenomenon of mixing grammatical structures of two or more languages under varied social constraints. The code-switching data differ so radically from the benchmark corpora used in NLP community that the application of standard technologies to these data degrades their performance sharply. Unlike standard corpora, these data often need to go through additional processes such as language identification, normalization and/or back-transliteration for their efficient processing. In this thesis, we investigate these indispensable processes and other problems associated with syntactic parsing of code-switching data and propose methods to mitigate their effects. In particular, we study dependency parsing of code-switching data of Hindi and English multilingual speakers from Twitter. We present a treebank of Hindi-English code-switching tweets under Universal Dependencies scheme and propose domain adaptation techniques to efficiently leverage monolingual syntactic annotations and the annotations from the Hindi-English code-switching treebank. Firstly, we propose modifications to the parsing models which are trained only on the Hindi and English monolingual treebanks. We have shown that code-switching texts can be efficiently parsed by the monolingual parsing models if they are intelligently manipulated. Against an informed monolingual baseline, our parsing strategies are at-least 10 LAS points better. Secondly, we propose a neural stacking model for parsing that efficiently leverages part-of-speech tag and syntactic tree annotations in the code-switching treebank and the monolingual Hindi and English treebanks. We also present normalization and back-transliteration models with a decoding process tailored for code-switching data. Our neural stacking models achieve an accuracy of 90.53% for POS tagging and 80.23% UAS and 71.03% LAS for dependency parsing. Results show that our neural stacking parser is 1.5% LAS points better than the augmented parsing model and our decoding process improves results by 3.8% LAS points over the first-best normalization and/or back-transliteration.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Universal Dependency Parsing of Hindi-English Code-switching

Abstract