IIIT Hyderabad Publications |
|||||||||
|
Improving modality interactions in Multimodal SystemsAuthor: Tanmay Sachan Date: 2023-06-16 Report no: IIIT/TH/2023/76 Advisor:Vasudeva Varma AbstractData on the internet is growing at ever increasing rates. People share content with each other (or the world at once) over social media platforms such as the likes of Twitter, and consume content from online media outlets such as news agencies like CNN and Dailymail. Gone are the days of monolithic text-only blogs. Online content today generally encompasses multiple modes, or modalities, of communication coming together to convey information. These modalities include text along with images, audio and video. Machine learning models have long been able to capture and understand images and text as separate entities. The first neural network to be used on images dates to before the advent of internet itself. However, only recently, have machine learning researchers started to adopt the notion that multiple modalities can be understood better in a shared setting and under a common architecture, not as isolated black boxes. In this thesis, we attempt to use data-driven approaches towards understanding and improving the interaction of modalities within neural network architectures. The first problem we tackle is that of fake news detection in tweets and posts on microblogging websites such as Weibo. Existing works on the problem focus on independently encoding the different modalities and there is a lack of emphasis on shared learning. Our model attempts to generate richer embeddings through a combination of embeddings generated from pre-trained models. We managed to achieve results that beat the state of the art architecture on the problem statement, and the work was accepted as a full paper in the ASONAM 2021 conference. The second problem we tackle is that of image-aided summarization. While text summarization is a problem that has existed for an eternity, it is not enough to condense information in modality rich sources like news articles. Our model tries to generate textual summaries of articles that demonstrate overlap with image content present in the article, along with selecting the most relevant image from the entire article. We make use of multimodal information retrieval models such as OSCAR to aid in the intermixing of modal information. The third problem we dive into in this thesis is that of content recommendation. Undertaken as a project at LinkedIn, in this problem we try to improve the ranking of LinkedIn Learning content for each user. Since user history is causal, we enable use of time as a modality through making use of techniques such as Time2Vec and train ranking models jointly to enable better vi vii representation of user history and prediction of future action. Through this methodology, we were able to create a strong recommendation system. In the fourth problem, we take a look at the availability of multimodal datasets in Indic languages. To enable and enrich research in this domain in an Indic setting, we try to create the first authentic (not translated) dataset of Image-Text pairs in 11 Indian languages. We use deep-learning based caption filtration techniques to prune down the Samanantar dataset, and then use a query simplification algorithm to create queries to download images related to those sentences. Our work enables the creation of large multimodal models such as CLIP and OSCAR within an Indian setting. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |