IIIT Hyderabad Publications |
|||||||||
|
Audio and Text based Multimodal Sentiment Analysis using Features Extracted from Selective Regions and Deep Neural NetworksAuthor: Harika Abburi Date: 2017-06-29 Report no: IIIT/TH/2017/40 Advisor:Suryakanth V Gangashetty,Manish Shrivastava AbstractSentiment analysis has emerged as a field, that has attracted a significant amount of attention over the last decade. This is the area of study to analyze people reviews, songs and attitudes from different types of data and classify whether it is a positive, negative or neutral. Recent advancement of social media which is an enormous ever-growing source has led people to share their views through various modalities such as audio, text and video. This source of information is important to automatically make out the sentiment embedded in the different types of data such as reviews and songs. In this thesis, an improved multimodal approach to detect the sentiment of product reviews and songs based on their multi-modality natures (audio and text) is proposed. The basic goal is to classify the input data as either positive or negative sentiment. Database used in this study are Spanish product reviews, Hindi product reviews and Telugu songs. Most of the existing systems for audio or speech based sentiment analysis use the conventional audio features which are extracted from entire signal, but they are not domain specific features to extract the sentiment. In this work, instead of extracting the features from entire signal, a specific regions of an audio signal have been identified and experiments are performed on these regions by extracting the relevant features. For songs data, experiments are performed over each song, and its beginning and ending regions. For all these cases, Gaussian Mixture Models (GMM) and Support Vector Machine (SVM) classifiers are built using the prosody, temporal, spectral, tempo and chroma features. Experimental results shown that the ate of detecting the sentiment of a song is high at beginning of a song compared to its ending region and over the entire duration of the song. This is because, the instruments and vocals which convey the sentiment for beginning part of the song may or may not sustain throughout the song. For the reviews data, we could not perform these experiments because in such cases the sentiment may not be present at beginning or ending regions of an utterance. So for the reviews data, the stressed and normal regions are identified using the strength of excitation. From the stressed regions, Mel Frequency Cepstral Coefficients (MFCC) features are extracted and GMM classifier is built. Further, experiments are performed by extracting the prosody (energy, pitch and duration) and relative prosody features from both the regions and from the entire audio signal and a GMM classifier is built. From the results, it is observed that, the performance at specific regions is better as compared to the entire signal. It is also observed that, relative prosody features extracted from both the regions has high accuracy of detecting the sentiment compared to the prosody and MFCC features. This is because, the natural variations present in the prosody features are reduced using the relative prosody features. Recently, neural networks have achieved good success on sentiment classification. In this work also different deep learning architectures like Deep Neural Network (DNN) and Deep Neural Network Attention Mechanism (DNNAM) are explored. Here stressed regions concept fail because of less training data. DNN performance depends on the amount of training data. The more the training data, the more accurate it is. So here the experiments are performed by combination of frames which result in better performance because each frame will not carry the sentiment. MFCC features considered are 13-dimensional, 65-dimensional and 130-dimensional feature vectors. From the studies, it is observed that DNNAM classifier gives better results compared to DNN, because the DNN approach is a frame based one where as the DNNAM approach is a utterance level classification there by efficiently making use of the context. For text based sentiment analysis, transcriptions are carried out manually from the audio signal. For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are computed by Doc2Vec vectors. As in the audio, here also experiments are performed at the beginning, the ending and over the entire song. The studies shown that beginning of a song has high accuracy in detecting the sentiment compared to the ending region and over the entire song. As similar experiments could not be carried out with the reviews data, entire document is taken as an input to extract the sentiment. Support Vector Machine (SVM) and Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) classifiers are used to develop a sentiment model with the textual features computed by Doc2Vec and Word2Vec. From the experimental studies, it is observed that LSTM-RNN outperforms the SVM because LSTM-RNN is able to memorize long temporal context. Finally, we combine both the modalities such as audio and text to extract the sentiment. Both the modalities are hypothesized based on the highest average probability of the classifiers. It is observed from the studies that the simultaneous use of these two modalities help to create a better sentiment analysis model to detect whether the given input is positive or negative sentiment. Keywords: Sentiment Analysis, Multimodal Classification, Text features, Audio features, Lyric features, Stressed regions, Normal regions, Relative prosody features, Mel frequency cepstral coefficients, Doc2Vec, Word2vec, Gaussian Mixture Models (GMM), Support Vector Machine (SVM), Naive-Bayes (NB), Deep Neural Networks (DNN), Deep Neural Network Attention Mechanism (DNNAM), Long Short Term Memory-Recurrent Neural Network(LSTM-RNN). Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |