IIIT Hyderabad Publications |
|||||||||
|
Generative models for learning document representations along with their uncertaintiesAuthor: Santosh Kesiraju Date: 2021-01-22 Report no: IIIT/TH/2021/11 Advisor:Suryakanth V Gangashetty,Luk´aˇs Burget AbstractMajority of speech and natural language processing applications rely on word and document representations (or embeddings). The document embeddings encode semantic information which makes them suitable for tasks such as topic identification (document classification), topic discovery (document clustering), language model adaptation, and query-based document retrieval. These embeddings are usually learned from widely available un-labelled data; hence generative or probabilistic topic models which aim to capture the distribution of data are suitable. Although there exist several probabilistic and neural network-based topic models to learn these embeddings, they often ignore to capture the uncertainty in the estimated embeddings. Thus, any error in the estimation of these embeddings affects the performance in downstream tasks. The uncertainty in the embeddings is usually due to shorter, ambiguous or noisy sentences/documents. This thesis presents model(s) for learning to represent document embeddings in the form of Gaussian distributions, thereby encoding the uncertainty in their covariances. Further, these learned uncertainties in embeddings are exploited by the proposed generative Gaussian linear classifier for topic identification. This thesis proposes to use subspace multinomial model (SMM), a simple log-linear model for learning document embeddings. Experiments on 20Newsgroups text corpus show that the embeddings extracted from SMM are superior when compared to popular topic models such as latent Dirichlet allocation, sparse topical coding in topic identification and document clustering tasks. Using the variational Bayes framework on SMM, the model is able to infer the uncertainty in document embeddings, represented by (posterior) Gaussian distributions. Additionally, the common problem of intractability which appears while performing variational inference in mixed-logit models is addressed using Monte Carlo sampling via the re-parametrization trick. The resulting Bayesian SMM achieves state-of-the-art perplexity results on 20Newsgroups text and Fisher speech corpora. The proposed generative classifier exploits the learned uncertainty in the document embeddings; and achieves state-of-the-art classification results on the aforementioned corpora as compared to other unsupervised topic and document models. Furthermore, this thesis presents a multilingual extension of the Bayesian SMM for zero-shot cross-lingual topic identification. The proposed model achieves superior classification results when compared to the systems based on multilingual word embeddings and neural machine translation inspired sequence-to-sequence bidirectional long-short term memory models. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |