Towards Understanding Bollywood Lyrics

Author: Drushti Apoorva G
Date: 2018-11-23
Report no: IIIT/TH/2018/82
Advisor:Radhika Mamidi

Abstract

Research in Natural Language Processing is expanding in multiple domains and is seeping into all aspects of life with time. With every advancement, the variety of text that can be processed is growing. One such domain is lyrics processing. Songs are vital to the music and film industry and can be analysed to obtain important information such as genre, theme, mood, etc. of the song and supplement the information gathered by the study of its audio features. Bollywood, the Indian film industry makes a lot of revenue making use of songs. The number of songs churned out by this industry is massive and is a rich source of audio and textual data for Natural Language Processing tasks. It also gives us an opportunity to work on data in Hindi which is a relatively less explored field. The focus of this thesis is on the textual part of the data. In an attempt to create a data resource for this domain, this work presents a corpus of Bollywood song lyrics and its metadata, annotated with sentiment polarity. We call this BolLy. It contains lyrics of 1055 songs ranging from those composed in the year 1970 to the most recent ones. This dataset is of utmost value as all the annotation is done manually by three annotators and this makes it a very rich dataset for training purposes. In this work, we describe the creation and annotation process, content, and the possible uses of the dataset. As an experiment, we have built a basic classification system to identify the emotion polarity of the song based solely on the lyrics and this can be used as a baseline algorithm for the same. The lyrics of Hindi songs have been used for their classification as having positive or negative sentiment by extraction of opinions. Some experiments employing subjectivity lexicons and probabilistic approaches were conducted for this on a dataset which was just a subset of ‘BolLy’. This motivated us to work with a bigger dataset and focus our efforts towards expanding it. With the complete ‘BolLy’ dataset, the experiments have been created using different variations of the Naive Bayes Classifier. There can be a multitude of Natural Language Processing applications on the presented dataset. This thesis work contains one of them explored in detail. Keywords of a document are a representative of its content, and it helps to have meaningful words to facilitate search and organization of documents. Hence, finding methods that can automatically identify keywords in a document is very important as manual processes for this is very cumbersome and error-prone.If this task is accomplished for song lyrics, it has varied applications such as recommendation systems and digital music library management. This work proposes and compares methods to identify keywords from lyrics of Bollywood songs. We use a collection of lyrics of 1055 Bollywood songs, all written in the Devanagari script. Experiments include looking at the spatial distribution of the terms, their occurrence in a certain context or position, and using WordNet to generate keywords not present in the document. Validation was done by human annotators by providing a score to each method based on the results obtained on a subset of the data. We also used Latent Dirichlet Allocation and Latent Semantic Indexing to validate the results, as further explained in the paper.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards Understanding Bollywood Lyrics

Abstract