Exploiting the Properties of Word Embeddings to Improve the Representations of Word Sequences

Author: Narendra Babu Unnam
Date: 2023-07-17
Report no: IIIT/TH/2023/128
Advisor:P Krishna Reddy

Abstract

In the current digital era, about 80% of the digital data which is being generated is unstructured and unlabeled natural language text. Research efforts are going on for improving text mining techniques to automatically organize, analyze, and extract useful information from the voluminous text data. In the development cycle of information retrieval and text mining applications, text representation is the most fundamental and critical step as its effectiveness directly impacts the performance of the application. Three important properties of the text representation models, which make them suitable for practical applications, are representational power, interpretability of features, and unsupervised learnability. Words and word sequences (such as sentences and documents) are the natural units of the text which are subjected to vector representation to handle the given text data. With the advent of deep learning, a family of neural language models emerged to represent words as low dimensional, distributed, and dense vectors. These word representations are popularized as word embeddings. Word embeddings are learned from huge corpora, so they encode a lot of information and have high expressive power. In the case of word sequences, traditional methods provide interpretability and support unsupervised learning, but they suffer from the issues of low representational power and scalability. The deep learning based word embedding methods are successful in producing vectors with high representational power. However, most of the deep learning based methods are notorious for human uninterpretabilty and work only in supervised learning environments. In this thesis, we propose improved word sequence representation models by exploiting the frequency distributional and spatial distributional properties of the word embeddings. Firstly, we propose an alternative word sequence representation framework in the context of longer texts such as documents. The existing vector averaging based models represent the document as a position vector in the same word embedding space. As a result, they are unable to capture the multiple aspects as well as the broad context in the document. Also, due to their low representational power, the existing approaches perform poorly at document classification. Furthermore, the document vectors obtained using such methods have uninterpretable features. In this work, we propose an improved document representation framework that captures multiple aspects of the document with interpretable features. In this framework, instead of representing a document in word embedding space, it is represented in a distinct feature space where each dimension is associated with a potential feature word that has relatively high discriminatory power. A given document is modeled as the distances between the feature words and the document. We have proposed two criteria for the selection of potential feature words and a distance function to measure the distance between the feature word and the document. Experimental results on multiple datasets show that the proposed model consistently performs better at document classification over the baseline methods. Secondly, we propose a weighted averaging based word sequence embedding method in the context of shorter texts such as sentences. The proposed weighting scheme captures the contextual diversity of words based on their geometry in the embedding space. It is observed that the simple unweighted vector averaging ignores the discriminative power of words and treats all the words as equal in the given sentence. Assigning the weights to words based on their discriminative power helps in building better word sequence representations. Recent literature introduced word weighting schemes based on the word frequency distribution into the simple averaging model. The frequency-based weighted averaging models augmented with the denoising steps are shown to outperform many complex deep learning models. However, these frequency-based weighting schemes derive the word weights solely based on their raw counts and ignore the diversity of contexts in which these words occur. The proposed weighting algorithm is simple, unsupervised, and non-parametric. Experimental results on semantic textual similarity tasks show that the proposed weighting method outperforms all the baseline models with significant margins and performs competitively to the current frequency-based state-of-the-art weighting approaches. Furthermore, as the frequency distribution-based approaches and the proposed word embeddings geometry-based weighting approach capture two different properties of the words, we define hybrid weighting schemes to combine both varieties. We also empirically demonstrate that the hybrid weighting methods perform consistently better than the corresponding individual weighting schemes. Overall, we have proposed a new word sequence representation framework and weighting scheme by exploiting the geometrical properties of word embeddings. Combined with existing frequency based approaches, the proposed spatial geometry based framework exhibits a potential for better representation of word sequences which will improve the performance of text mining based applications in diverse domains.

Full thesis: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

Exploiting the Properties of Word Embeddings to Improve the Representations of Word Sequences

Abstract