IIIT Hyderabad Publications |
|||||||||
|
Towards Semantic Role Labelling of Hindi-English Code-Mixed DataAuthor: riya.pal Date: 2020-04-28 Report no: IIIT/TH/2020/27 Advisor:Dipti Misra Sharma AbstractSemantic Role Labelling (SRL) is a sub-task of Natural Language Understanding. It helps us understand the function of a particular token, with respect to its associated verb in a sentence. SRL identifies the event (verb) of the sentence and the associated participants in the sentence. SRL is used as an intermediate step in various tasks such as information extraction, question answering systems, machine translation, to name a few. In the past decade, social media has become increasingly popular. As a result, there is a huge amount of data available on forums like Facebook and Twitter which can’t be processed or analysed by traditional Natural Language Processing (NLP) tools. This data usually doesn’t follow grammatical rules, is spelt incorrectly and often, for multilingual users, contains a mixture of languages. In this thesis, we discuss our effort towards building a semantic role labeller for such ‘mixed language - hinglish’ data, i.e. Hindi-English code-mixed data. This is the first attempt at creating a SRL tool for Hindi-English code-mixed data, or for that matter, any code-mixed data, to the best of our knowledge. First, we created a dataset which is annotated with semantic roles. We released this dataset of 1460 Hindi-English code-mixed tweets consisting of 20,949 tokens, which is parsed and annotated with semantic roles (Proposition Bank labels). We also created required frame files to take into account mixed language complex predicate formations and certain special constructions which are unique to code-mixed data. We then built a rule-based baseline model for automated labelling of this data. Research shows that there is a strong relation between dependency labels and semantic roles and we leverage this information to build rules for our model. This model can further reduce annotation effort in future. The task is carried out in two steps. The first step is to identify the participants (arguments) of the event (verb) in the sentence. The second step is to classify these identified arguments according to their function, i.e. mark their semantic roles. We then did a thorough error analysis. Further, we worked on improving the accuracy scores obtained by our baseline model. For this we explored statistical methods used in the past for SRL in English and Indian Languages. Before that, we described our effort towards pre-processing and normalising the data to ensure consistency and uniformity of tokens across the corpus. Then, we formulated and extracted various features which work towards capturing semantic information from our data. These are a combination of features traditionally used for SRL, features used in Hindi SRL and features specific to the code-mixed nature of our data. We experimented with and compared the performance of our model by using Paninian dependency labels and Universal Dependency (UD) labels as well. We also trained our model on Hindi monolingual corpus and tested on code-mixed data. We compared and analysed the performance of our models for the various experiments. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |