IIIT Hyderabad Publications |
|||||||||
|
Semantic Role Labeling for Indian languagesAuthor: Aishwary Gupta Date: 2019-09-26 Report no: IIIT/TH/2019/109 Advisor:Manish Shrivastava AbstractSemantic role labeling, also known as shallow semantic parsing is a task in Natural Language Processing to determine labels of the words or the phrases (group of words) in a sentence. The labels are of the type agent, patient/receiver, goal, temporal, locative or object. It comes under the domain of Artificial Intelligence for the reason that we want to do this task automatically. Formally, the task can be broken into various steps. The first step is to identify the verbs or the predicates in a sentence. The second step consists of detection of the semantic arguments associated with the corresponding predicates in the sentence and the final step would be to label these arguments pertaining to their respective predicates. For instance, in the sentence, “Ram killed Shyam with a gun”, the semantic role labeler should be able to recognize ‘killed’ (represents the phrase “to kill”) as a predicate. Then ‘Ram’ as representing the killer (agent), ‘Shyam’ as the recipient/receiver and “a gun” as representing the theme/object. Such representations are an important step towards figuring the meaning of a sentence. Semantic role labeling (SRL) gives a lower level of abstraction compared to a full syntactic parsing of a sentence. Therefore, it has higher number of classes and groups fewer clauses in each class. For example, “the car belongs to Ram” will need two labels like ‘owned’ and ‘owner’ whereas “the car was sold to Ram” will need two different labels altogether such as ‘theme/goal’ and ‘receiver’ even though these two clauses would be similar if we use subject and object functions. Hence, the labels largely depend on the whole clause rather than just the individual phrase or chunk. In this thesis, we begin with a detailed overview of the literature in the field of semantic role labeling. This includes discussion over the various techniques used to tackle shallow semantic parsing in the past, the development of different datasets built for semantic analysis such as the Propbank and the Framenet and at last we talk about semantic role labeling task for Indian languages. Since most of the Indian Languages are very low-resourced, we have kept our work restricted to only Hindi and Urdu, for which the respective propbanks were good in size. We also give a detailed analysis of the Hindi/Urdu Propbank data sections used by us in our experiments.Next, we present a statistical semantic role labeler for Hindi which is extended to Urdu as well. We propose a set of new features enriching the existing baseline system for these languages. We break the system into two subsequent tasks - Argument Identification and Argument Classification respectively. Our experiments show a reasonable improvement with respect to the previous baseline for Hindi, mainly for the classification step. We also report significant improvements for Argument Identification task in Urdu. We propose a new baseline system for Indian languages using 5-fold cross-validation and we capture results excluding the ‘null’ class and including the ‘null’ class exclusively. Lastly, we give an error analysis of our model compared with the previous baseline and look at both the improvements and limitations of this model. Finally, we introduce a supervised deep learning model for the same Indian languages namely, Hindi and Urdu which uses minimal syntax and yet improves over our statistical model significantly. We progress with three different models inspired from the recent advancements in this field. In the first model we make use of sequence modeling to generate dependency path embeddings and jointly learning the classification process i.e., both Identification and Labeling of arguments. The second model is a syntax agnostic model where we encode the full sentence using a bi-directional LSTM encoder only using the raw words/tokens. The third and the final model adds dependency label to the previous model making it slightly syntax-aware and performs very well compared to the other models. At last, we talk about evaluation metrics and analysis of the three models as well as the statistical model. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |