IIIT Hyderabad Publications |
|||||||||
|
Improving Word Embeddings and Using ThemAuthor: Prakhar Pandey Date: 2019-07-13 Report no: IIIT/TH/2019/81 Advisor:Vikram Pudi AbstractRepresentation of words as dense vectors has recently gained traction following the introduction of word2vec model. Traditional word representation models are sparse, long and each dimension represents a word in vocabulary. Dense vectors in contrast are dense meaning most values are non zero, short meaning the dimension of vectors is very less compared to size of the vocabulary and also implicitly encode semantics within the dimensions as the dimensions no longer represent a word from the vocabulary. These advantages have made them the model of choice for various machine learning models. In this work we introduce a method to improve word embeddings and show the utility of word embeddings in tasks of controversy detection and item recommendation. In the first part we develop a method to improve word embeddings of a language using resources of some other language. Word embeddings learned from text corpus can be improved by injecting knowledge from external resources, while at the same time also specializing them for similarity or relatedness. These knowledge resources (like WordNet, Paraphrase Database) may not exist for all languages. We introduce a method to inject word embeddings of a language with knowledge resource of another language by leveraging bilingual embeddings. First we improve word embeddings of German, Italian, French and Spanish using resources of English and test them on variety of word similarity tasks. Then we demonstrate the utility of our method by creating improved embeddings for Urdu and Telugu languages using Hindi WordNet, beating the previously established baseline for Urdu. In second part we demonstrate a method to detect controversy on news issues. Detecting controversial news topics on web is a relevant problem today. It helps to identify the issues upon which people have divided opinion and is specially useful on topics such as a presidential election, government reforms, climate change etc. First we use word embeddings to find out similar news articles about a given topic. Then we perform an analysis of peoples reaction on social media to these news articles. We use sentiment analysis and word matching to accomplish this task. We show the application of our method for detecting controversial topics during the US Presidential elections 2016. The crucial step in controversy detection is to eliminate bias due to people from a particular political spectrum, so we carry out our analysis on a group of articles reporting an incident, rather than a single article. We use word embeddings to find articles reporting on a topic from different sources like left, right or mainstream. The quality of embeddings used has a direct effect on the performance of the system. In third part, we show another application of the word embedding literature where we borrow the concepts of word embeddings and retrofitting to recommend items in an aggregated marketplace. Full thesis: pdf Centre for Data Engineering |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |