Text Embeddings in Riemannian Manifolds -

Author: Souvik Bannerjee
Date: 2023-07-07
Report no: IIIT/TH/2023/116
Advisor:Manish Shrivastava

Abstract

Unsupervised text embedding models are ubiquitous in the field of Natural language Processing. These models embed words, sentences or documents as vectors in an Euclidean space with the principle that textual information that are semantically similar have similar representations i.e they lie close to each other in the semantic space. Word2vec and GLoVe are the most popular examples of such models. Both these models provide efficient training of word embeddings and provide evaluation methods like word similarity and word analogy tasks that proves the effectiveness of these models. However, they have their fair share of drawbacks. For instance, there is little to no explanation on why word vector summation solves word analogy. These models also suffer from meaning conflation deficiency where multiple senses of a word are represented by one single vector and thus lead to inaccurate semantic modelling. This is also an issue with document and sentence embeddings which span multiple words, phrases and topics. Existing unsupervised models which tackle meaning conflation deficiency perform sense representation where each senses of a word are represented by an individual vector. This is done by modifying the Word2vec skip-gram algorithm and adding some extra constraints on the vector embedding space. Instead of adding extra constraints on the Word2vec algorithm, we aim to address this issue of by exploiting the geometry of the embedding space. We propose three unsupervised text embedding models that embed texts in Riemannian manifolds by integrating the linguistic principles of Word2vec and GLoVe with optimization tools from differential geometry. The first and the second model uses the same joint modelling framework but embeds both words and documents in the Grassmannian manifold and a custom product manifold respectively. Both models generate quality document embeddings given by their evaluation results in document clustering and document classification tasks. The third model embed words in Spectahedron manifold where each word is a matrix whose eigenvectors correspond with the multiple senses of that word. Finally, we provide a Lie-group theoretic understanding of linear substructures present in Word2vec and GLoVe that solves word analogy. Lie groups are differentiable manifolds with a group structure. Some basic properties of these groups show that the linear substructures are present due to the tangent space of the Identity element of the group.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Text Embeddings in Riemannian Manifolds -

Abstract