Exploiting Textual Content in Academic Citation Networks

Author: Soumyajit Ganguly
Date: 2019-02-14
Report no: IIIT/TH/2019/7
Advisor:Vikram Pudi

Abstract

An academic citation network can be viewed as an information graph where individual nodes contain rich textual information. With the current trend of open-access to most scientific literature, we presume that the full text of a scientific article (or paper) is a vital source of information which aids in various recommendation and prediction tasks concerning this domain. The textual content of a paper serves as a rich source of information about a node in the citation network graph. The first task we focus on is competing algorithm mining from research papers. As an example, consider the association rule mining problem in data mining or the object recognition problem in computer vision. There has been a lot of research on these particular ideas and several authors and research groups have come up with their novel solutions to tackle these problems. The authors would measure their proposed algorithms against current state-of-the-art. Given the textual content of a paper we propose to mine all the algorithms which are being compared against inside the paper of interest. As our first contribution, we develop an unsupervised mechanism leveraging natural language processing techniques to solve this problem. Recently representation learning using neural networks has been receiving much attention from both industrial and academic communities. Thanks to the advent of powerful GPUs and advanced techniques like dropout and rectified linear units, the interest in neural networks has been revived. The current state-of-the-art solution for the applications in the field of Natural Language Processing (NLP) and Information Retrieval (IR) is powered by neural network based representation learning models. We explore the similar theme of using text content from research papers and propose ways of improving search, retrieval and recommendation in academic citation networks. For example, two papers published between a short timespan would not neccessarily cite each other even if they cater to very similar ideas. As our second contribution we introduce Paper2vec: a novel two step neural network based approach to obtain vector representaions of scientific papers from an academic citation network by exploiting both its textual content and network (graph) properties. Our third contribution is on similar lines, where instead of papers we focus on authors inside the whole network. We specifically explore how we can exploit the content of papers written by an author to aid in possible collaboration prediction and propose Author2vec: a novel joint training approach using neural networks for the same. In the growing body of open access to research papers and journals, the set of algorithms and frame- works proposed in this thesis could act as a first step towards the use of full textual content for the aid of various prediction and recommendation tasks in academic citation networks.

Full thesis: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

Exploiting Textual Content in Academic Citation Networks

Abstract