Towards Better Topic Models For Contemporary Textual Documents Of Social Media

Author: Prateek Mehta
Date: 2017-10-04
Report no: IIIT/TH/2017/74
Advisor:Vasudeva Varma

Abstract

The vast scale information available online is a rich source of collective human knowledge. The major transition from conventional information sources such as books and archives to online sources such as web pages, etc. has introduced the need to categorize, organize and mine the digital corpus since the advent of the internet age. Immense effort has been put into this direction since the last decade or so resulting in a lot of improvement in mining conventional documents. One of the major improvement in better understanding of textual documents has been because of Topic Models such as LDA (Latent Dirichlet Allocation) which are used to learn and highlight abstract predominant topics in the documents. However, more recently, due to advances in technology, new age social media platforms have equipped their users to communicate, network and share knowledge but changed the way information is being generated, propagated and consumed. It has not only affected the means of information sharing but also the characteristics of the documents in terms of their language, style, size and vocabulary. The document size and style pose several challenges to conventional data mining tools to be used to their full potential because they rely on document level word co-occurrences to learn important topics in the corpus. They need larger documents compared to a “typical” social media post. In this thesis, we focus on improving the performance of topic models for contemporary documents and text corpus generated in the form of social media posts. We discuss and propose an improvement to previous efforts in the direction and also introduce a new algorithm, sentence2cluster, which also helps in document categorization and organization. Our first proposition to improve the topic models is an extension of previous efforts, which mainly focuses on prepossessing of the documents in terms of document pooling or aggregation. A popular aggregation scheme is user-based document pooling, where the underlying assumption is that a person has an interest in a limited number of topics, therefore, it is useful to club all his documents into one big document. We extend this approach and propose to aggregate documents not only for a single user but also for users who share similar interests. We use many indicators such as the use of hashtags, communication amongst users, influences, etc. to identify a community of users who can pool their documents to learn a potentially better topic model. This community-based pooling instead of a single user based pooling not only increases the context and document size but also helps highlight niche topics which are popular only among a small group of people who are not as loud as other users. We run our experiments on Flickr datasets where we find communities of users who can pool their documents. We use network information and many user attributes to find such communities. We show that topics learned after community-based document aggregation are more coherent than other previously proposed document aggregation schemes. The second effort in the direction of improving Topic Models is the use of word embeddings. Topic modeling algorithms such as LDA do not use the information hidden in the order of the words in a sentence. In our new approach sentence2cluster, we use this information to learn meaningful word representations, followed by clustering them. The word clusters represent the broader corpus level topics which are assumed to be found in documents with different proportions. The idea to learn corpus level topics first and then infer their presence in every document is inspired from Bi-Term Topic model (BTM), which also focuses on improving topic models for short text but does not utilize information hidden in word ordering in a sentence. We evaluate the utility of the topics learned through our approach by building a navigated corpus browser on real world documents posted online on various social media platforms and forums. We then conduct a survey seeking user experience and to get a feedback on interpretability and coherence of topics and effective utility of our approach. The results indicate the potential in using word level semantics for topic modeling for short textual documents and the positive feedback that supports our hypothesis. Our work presents an elaborate discussion on various aspects of Topic Modeling and how they can be improved. We focus both on pre-processing steps using better aggregation schemes and on leveraging information hidden in sentence structure and word level semantics. In this manner, we hope that it drives fruitful research in various tasks of information categorization and organization in the future by drawing attention towards better Topic Models for contemporary texts.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards Better Topic Models For Contemporary Textual Documents Of Social Media

Abstract