Exploring Event Extraction Across Languages

Author: Suhan Prabhu
Date: 2020-08-18
Report no: IIIT/TH/2020/71
Advisor:Manish Shrivastava

Abstract

This thesis is a culmination of the work done on event detection, annotation and analysis. We present here the development of the detection of events on a large scale for low resource languages in two ways from a computational and a linguistic perspective. From the linguistic perspective, we discuss the creation of a language specific event annotation and representation task for Kannada, a morphologically rich resource poor Dravidian language and Hindi, a popular Indo-Aryan language. From a computational perspective, we look into leveraging information from resource rich languages and use transfer learning in order to detect events in a resource poor environment as well. We analyze events in Hindi briefly and Kannada, in depth. As most information retrieval and extraction tasks are resource intensive, very little work has been done on Kannada NLP, with almost no efforts in discourse analysis and dataset creation for representing events or other semantic annotations in the text. In this thesis, we linguistically analyze what constitutes an event in this language, the challenges faced with discourse level annotation and representation due to the rich derivational morphology of the language that allows free word order, numerous multi-word expressions, adverbial participle constructions and constraints on subject-verb relations. Therefore, this is one of the first attempts at a large scale discourse level annotation for Kannada, which can be used for semantic annotation and corpus development for other tasks in the language. On the other hand, from a processing viewpoint, detection of TimeML events in text have traditionally been done on corpora such as TimeBanks. Traditional architectures revolve around highly feature engineered, language specific statistical models. In this thesis, we also present a Language Invariant Neural Event Detection (ALINED) architecture. ALINED uses an aggregation of both sub-word level features as well as lexical and structural information. This is achieved by combining convolution over character embeddings, with recurrent layers over contextual word embeddings. We find that our model extracts relevant features for event span identification without relying on language specific features. We compare the performance of our language invariant model to the current state-of-the-art in English, Spanish, Italian and French. We outperform the F1-score of the state of the art in English by 1.65 points. We achieve F1-scores of 84.96, 80.87 and 74.81 on Spanish, Italian and French respectively which is comparable to the current states of the art for these languages. We also introduce the automatic annotation of events in Hindi and Kannada with an F1-Score of 77.13 and 67.30 respectively.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Exploring Event Extraction Across Languages

Abstract