Discovery and Interpretation of Embedding Models for Knowledge Representation

Author: Ganesh J
Date: 2017-06-17
Report no: IIIT/TH/2017/38
Advisor:Vasudeva Varma,Manish Gupta

Abstract

Recently representation learning has been receiving much attention from both industrial and academic communities. Thanks to the advent of powerful GPUs and advanced techniques like dropout and rectified linear units, the interest in neural networks has been revived. The current state-of-the-art solution for the applications in the field of Natural Language Processing (NLP) and Information Retrieval (IR) is powered by neural network based representation learning models. In this thesis, we focus on the unsupervised representation learning models which are cheap to build (as they rely on unlabeled data) but very effective for many downstream applications. We explore two main challenges in building an unsupervised representation learning model for NLP and IR problems. Context serves as the source of knowledge for estimating the representation of data. For example, in Word2Vec, the context is the set of words surrounding a given word in a sentence. The first challenge we focus is the context insufficiency problem. For instance, consider the models used in practise to generate representations for tweets. We observe that tweets do not exist in isolation, and hence the performance of the models which work only with the content of the tweet (as context) is found to be sub-optimal. To handle this issue, we propose a better model which also captures the interactions between the tweet and its adjacent tweet (as context) in the users’ timeline. Along similar lines, we explore two more use cases where we smartly incorporate novel contexts that captures complex interactions such as scientific author and his/her paper interaction in an author collaboration network (‘Author2Vec’) and sentences interaction in a document (‘Doc2Sent2Vec’), that turned out to be advantageous in computing accurate author and document representations respectively. We conclude that if we smartly leverage the available contexts, the performance of the model improves significantly. Though the representation learning models perform well in practise, little is known about the core properties of the data encoded within the representations. Understanding these core properties would empower us in making generalizable conclusions about the quality of the representations. Hence, the second challenge we focus is the human interpretability problem with these automatically learned representations. For instance, researchers in Twitter analytics are getting interesting results by applying different representation learning models for several valuable tasks such as sentiment analysis, semantic textual similarity computation, microblog retrieval, hashtag identification and so on. In order to understand the core properties encoded in a tweet representation, we evaluate the representations to estimate the extent to which it can model each of those properties such as tweet length, presence of words, hash-tags, mentions, capitalization and so on. This is done with the help of multiple classifiers which take the representation as input. Essentially, each classifier evaluates one of the syntactic or social properties which are arguably salient for a tweet. The result is an application independent, fine-grained analysis of tweet representations generated by different representation learning models. This thesis is one of the initial work to overcome the above-mentioned challenges by proposing novel methods to improve the contexts and interpret the representations. In the growing body of representation learning research, the set of models and framework proposed in this thesis could act as the basic building blocks in the future works attempting to advance the science of building smarter NLP/IR systems.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Discovery and Interpretation of Embedding Models for Knowledge Representation

Abstract