Doc2Sent2Vec: A Novel Two-Phase Approach for Learning Document Representation

Authors: Ganesh J,Manish Gupta,Vasudeva Varma
Conference: the 25th International World Wide Web Conference (WWW 2016)
Location Montreal, Canada.
Date: 2016-04-11
Report no: IIIT/TR/2016/16

Abstract

Doc2Sent2Vec is an unsupervised approach to learn low dimensional feature vector (or embedding) for a document. This embedding captures the semantics of the document and can be fed as input to machine learning algorithms to solve a myriad number of applications in the eld of data min- ing and information retrieval. Some of these applications include document classi cation, retrieval, and ranking. The proposed approach is two-phased. In the rst phase, the model learns a vector for each sentence in the document using a standard word-level language model. In the next phase, it learns the document representation from the sen- tence sequence using a novel sentence-level language model. Intuitively, the rst phase captures the word-level coherence to learn sentence embeddings, while the second phase captures the sentence-level coherence to learn document embeddings. Compared to the state-of-the-art models that learn document vectors directly from the word sequences, we hypothesize that the proposed decoupled strategy of learning sentence embeddings followed by document embeddings helps the model learn accurate and rich document representations. We evaluate the learned document embeddings by considering two classification tasks: scientific article classification and Wikipedia page classification. Our model outperforms the current state-of-the-art models in the scientific article classi cation task by 12.07% and the Wikipedia page classi cation task by 6.93%, both in terms of F 1 score. These results highlight the superior quality of document embeddings learned by the Doc2Sent2Vec approach.

Full paper: pdf

Centre for Search and Information Extraction Lab

IIIT Hyderabad Publications

Doc2Sent2Vec: A Novel Two-Phase Approach for Learning Document Representation

Abstract