Analysis of Emotional Speech using Excitation Source Information

Author: Gangamohan P
Date: 2019-07-11
Report no: IIIT/TH/2019/87
Advisor:B Yegnanarayana,Suryakanth V Gangashetty

Abstract

Speech communication carries message at two levels, the explicit message and implicit message. The explicit message corresponds to the linguistic information. The implicit message consists of information regarding the speaker’s signature, underlying emotion of the speaker, gender, etc. The speech production of basic linguistic sound units is generally understood using factors like voicing, place and manner of articulation. The implicit information like gender and speaker’s signature is due to physiological constraints of the speaker. In different emotional states, the linguistic sound units are produced by changes in the production mechanism in such a way that they do not effect the linguistic information. The state-of-the-art approaches for automatic emotion recognition use common features for representation. The feature representations can be categorized into spectral, voice quality, and prosody features. Along with the emotion-related characteristics, the voice quality features carry information of speaker’s signature, and the spectral features carry sound unit information. Also due to sharing of similar properties across emotions, the performance of emotion classification systems developed using these features is generally low. The main challenge in emotional speech analysis is to identify and extract emotion-specific features, i.e., independent of speaker and sound units. The objective of this thesis is to study the features of speech production that contribute for emotion characteristics in speech, and develop methods to extract the emotion-specific features. The speech production mechanism is a combination of several components such as the vocal tract system, excitation source, and prosody-related information (intonation and duration). In this thesis, the relative contribution of different components of speech production for perception of emotion is studied. It is observed that the components related to the excitation source and prosody carry emotion characteristics predominantly. The significant excitation of the vocal tract system occurs around the instants of glottal closure (GCIs). The region around the GCI corresponds to high signal-to-noise ratio (SNR) region. Excitation source-related parameters such as the abruptness of glottal closure, strength of excitation, and energy of excitation extracted around the GCIs are examined for emotion classification. These excitation source-related parameters are also observed to be speaker-dependent, and hence they are expressed relative to the neutral speech for emotion classification. For extraction of features which are independent of speaker and sound units, a hierarchical approach is considered for detailed analysis. An important characteristic of emotional speech is that it carries inherent voice qualities. Voice qualities of speech such as arousal and rhythm are different for different emotions. In view of inherent voice qualities of emotion, the following studies are identified: Discrimination between modal speech and falsetto speech, identification of high arousal speech segments in modal speech, and discrimination between angry speech and happy speech in high arousal case. All the above studies have been carried out using excitation source information at various levels of speech. The excitation source information related to the abruptness of glottal closure at subsegmental level discriminates modal and falsetto speech. But this subsegmental information is inadequate for discrimination of high arousal and neutral speech. It is observed that the excitation source information of the entire glottal cycle is useful for identification of high arousal speech segments. For discrimination between angry speech and happy speech, the excitation source information at the suprasegmental level appears to be useful. The studies carried out in this thesis highlight the importance of analysis of the glottal vibration characteristics, and the need for different sets of features for different emotions. The key contributions of the this thesis are: • Discrimination between falsetto and modal utterances using the excitation source information at the instants of significant excitation • Identification of high arousal regions in the speech signal using the excitation source information of the entire glottal cycle. • Reference based emotion recognition system using the excitation source parameters extracted at subsegmental level of speech. • Collection of resourceful emotional speech database.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Analysis of Emotional Speech using Excitation Source Information

Abstract