Capturing and Resolving Entities and their Mentions in Discourse

Author: vandan.mujadia
Date: 2017-03-02
Report no: IIIT/TH/2017/16
Advisor:Dipti Misra Sharma

Abstract

In Natural Language, two senses of reference are generally described. First is the symbolic relationship that expressions (words or phrases) have with the concrete or abstract real world objects and second sense is the relationship between two textual expressions in the text, in which one expression provides the necessary information to interpret or relate the other. These expressions are distributed across the text and their connections make co-referential chains (or links). For a text, coreference resolution is a task to identify and link expressions that refer to same object. These coreference expressions can be anaphors, nominal, verbal or verb-nominal expressions. Coreference links (chains) are useful in various NLP applications such as question answering, text summarization, Machine translation, etc. For English and several other European languages, the task of resolving coreference has been studied to a sufficiently great extent and various techniques have been proposed for those languages, but for Indian Languages, work has been very limited. In this thesis, we describe our work on second sense of reference by proposing co-reference representation scheme and different co-reference relation types between continuous mentions of the same coreference chain such, as identity, near-identity and weak identity relations and their sub-types. Then, we also propose and describe methods to resolve coreference for Hindi news and dialogue text. The six major points (contributions) are covered in this thesis on the topic of coreference. First, we identified Indian language specific conceptual, structural and representational issues in the existing coreference annotation schemes and tried to resolve them by proposing a unified coreference annotation framework and procedure. This framework includes various aspects of coreference like expression span, coreference chain, relation between contiguous expressions of same coreference chain, etc. Second, Based on the proposed annotation scheme, we developed a semi-automatic annotation tool (CAT - Coreference Annotation Tool) to ease the annotation process. Third, we propose a method for calculating inter annotator agreement on various aspect/level of coreference annotation. Fourth, Using proposed coreference annotation scheme and CAT, we annotated coreference information on some part of Hindi and Urdu Dependency Treebanks. Fifth, We also present a hybrid multi-classifier based approach to identify reference type for an anaphor (pronoun). We describe a hybrid (learning and rule) approach to resolve entity and event referring pronouns for Hindi dialogue and news text. In these approaches, we explore the use of dependency structures as a source of syntactic information and for entity anaphora resolution. We compare use of dependency structure based rules (features) over syntax based rules (features) for event anaphora resolution. Other than dependency based features we also explore the use of other linguistic information such as sub-topic boundary, animacy and Named Entity categories for dialogue anaphora resolution. Sixth, we present a sieve based approach for Hindi nominal reference resolution and relation type identification between continuous expressions of same coreference chains. In the this approach, we explore the use of Paninian dependency grammar, various linguistic rules based on gender, number, person, animacy, dbpedia based dictionaries, word-embeddings from word2vec - Glove as features and rules to resolve nominal co-reference. Hybrid system on these rules and features with various sieves (on predefined preference) gives considerable amount of accuracy for nominal reference resolution and relation type identification. At conclusion, we combine all the above mentioned modules (entity anaphora resolution, event anaphora resolution and nominal reference resolution) into one software-kit so the NLP community can use it and provide us feedback on the presented approaches.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Capturing and Resolving Entities and their Mentions in Discourse

Abstract