IIIT Hyderabad Publications |
|||||||||
|
The Hindi TimeBankAuthor: Pranav `Goel Date: 2020-06-25 Report no: IIIT/TH/2020/55 Advisor:Manish Shrivastava AbstractThe processing of temporal information is an important facet of discourse comprehension, and has been studied in linguistic philosophy and natural language processing equally. Major developments in this field include event identification and classification guidelines such as ISO-TimeML, and datasets based on those guidelines, known as TimeBanks. Language specific TimeBanks not only provide insight into the event representation mechanism of a language but also provide a corpus for deep learning tasks in NLP based on events and time information. In this thesis, we establish the Hin-TimeBank, an ISO-TimeML reference corpus with guidelines modified to handle both language specific and language independent additions to the ISO-TimeML schema. We present a set of annotation guidelines and specification designed for the development of an extensible Hindi TimeBank, a manually annotated 1,000 article corpus, marked with events and states, their categories, and temporal expressions. We also highlight the modifications to ISO-TimeML required for the identification and classification of events and states in Hindi. Unlike ISO-TimeML, we introduce states as a distinct concept and provide categories of events and states based on these modified guidelines. The modifications to the ISO-TimeML schema and the associated TimeML event annotation guidelines stem from a Paninian perspective of the event semantics and the pertinent syntax, a perspective which has been previously unexplored. The reliability of the guidelines and specification is determined by the high inter-annotator agreement. Furthermore, we describe the development of a knowledge graph from an event annotated corpus by presenting a pipeline that identifies and extracts the relations between entities and events from Hindi news articles. Due to the semantic implications of argument identification for events in Hindi, we use a combined syntactic argument and semantic role identification methodology. To the best of our knowledge, no other architecture exists for this purpose. The extracted combined role information is incorporated in a knowledge graph that can be queried via sub-graph extraction for basic questions. The architectures presented in this thesis can be used for participant extraction and event-entity linking in most Indo-Aryan languages, due to similar syntactic and semantic properties of event arguments. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |