IIIT Hyderabad Publications
Detection of Vowel Landmarks and its Application to Analysis of Speech-Laugh
Author: Sri harsha Dumpala
Report no: IIIT/TH/2016/25
Humans perceive speech as made up of sequence of discrete sounds. However to represent speech as a sequence of phones with well defined, non-overlapping intervals is a challenging problem. This is because of the continuously varying vocal tract system (i.e., vocal tract configuration and vocal fold vibration) leading to the co-articulation effect, where the current phone is considerably influenced by its adjacent phones. Despite this co-articulation effect, there exists events called landmarks, occurring due to abrupt variations in the vocal tract system. Detection of these landmarks can be used to represent speech as a sequence of phones which may help in building systems that process speech similar to that of humans. Landmarks are the time instants in the acoustic signal that are consistently correlated to major artic- ulatory movements such as transition of the vocal tract from a more open to a more closed configuration and vice-versa (i.e., change in manner of articulation), and transition from a free vibration to complete cessation of vocal fold vibration and vice-versa. These landmarks are foci and hence processing of speech can be done only around landmarks instead of considering the entire speech signal, thus reduc- ing the amount of processing required. Also, the analysis around different landmarks can be done with different resolutions. Landmark detection is hierarchical and hence more than one evidence can be ob- tained for making decisions. In this thesis, the hierarchy that exists among landmarks is exploited for vowel landmark detection (VLD). Sonorant segmentation is performed first, and then vowel landmarks are detected in the sonorant regions. Sonorant refers to the sound that is produced with no sufficiently strong constriction so as to produce turbulent noise or stoppage of airflow. The broad manner classes like vowels, nasals and approxi- mants are categorized under sonorants, whereas fricatives, stops and non-speech regions are considered as non-sonorants. Sonorant segmentation of speech signals is critical in developing automatic speech recognition (ASR) systems, audio search systems and for automatic segmentation of speech corpora. In this work, acoustic features based on excitation source and vocal tract system characteristics of sono- rant sounds are proposed for segmentation of sonorant regions in continuous speech. The features are based on energy of zero frequency resonator signal, strength of excitation and dominant resonance fre- quency around epochs. An algorithm is developed to relate these features in a hierarchical manner using knowledge-based approach. Performance of the proposed algorithm is studied on three different datasets, for varying levels of degradation. Vowels being a subclass of sonorant sounds, they exhibit characteristics similar to sonorants. So the features used for sonorant segmentation of speech are also considered for detection of vowel landmarks. Apart from these, features which capture characteristics specific to vowels are also considered for the task of vowel landmark detection (VLD). Using these features, a rule-based algorithm is developed for VLD. Performance of the proposed VLD algorithm is studied on three different databases namely, TIMIT (read), NTIMIT (channel degraded) and Switchboard corpus (conversational speech). The pro- posed algorithm is also tested on TIMIT and NTIMIT datasets for different levels of noise degradations. Speech-laugh is a speech-synchronous form of laughter that often occurs in natural conversation. Speech-laugh not only signifies the emotional state of a speaker, but also carry the linguistic informa- tion. Traditional automatic speech recognition (ASR) systems consider both laughter and speech-laugh as paralinguistic elements. This resulted in loss of information. Discriminating speech-laughs from laughter improves the accuracy of ASR systems. It also helps to know the emotion expressed by laughter i.e., happy, sarcasm etc. In this work, as an application of VLD, excitation source features extracted only in the vowel regions are analyzed for discriminating speech-laugh from laughter and neutral speech
Full thesis: pdf
Centre for Language Technologies Research Centre
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved.