Vowel Region based Speech Analysis and Applications

Author: Thirumuru Ramakrishna
Date: 2019-12-19
Report no: IIIT/TH/2019/127
Advisor:Anil Kumar Vuppala

Abstract

Speech signal can be considered as a sequence of acoustic events, where each acoustic event can be distinctly characterized by its production mechanism. Acoustic event-based signal pro- cessing techniques have gained vital importance in speech feature representation which is ro-bust to noise. The detection of specific acoustic events that correspond to the unique phonetic features of the speech is important in speech assessment systems and clinical speech analysis.In landmark or acoustic event based speech analysis, the speech signal is processed around specific locations known as events instead of the entire speech signal. Most of the Indian languages are vowel-centric, and hence consonants must be grouped with a vowel following it or preceding it. This process of clustering consonant or consonant cluster with either preceding vowel or next vowel is known as syllabification. The significance of accurate vowel region detection from a continuous speech has many applications that include robust speaker recognition, smart audio filtering, consonant-vowel unit recognition in Indian languages, speech rate manipulation and multimedia synchronization. With this motivation, the thesis explores the use of speech production knowledge in the detection of vowel regions. Further, the vowel region detection techniques are used to detect retroflex approximants and palatalized consonants in the Indian context. The important issues addressed in this thesis are listed below:• A two-stage algorithm is proposed to detect precise vowel regions using dominant spectral peaks of zero frequency filtered speech signal. In post-processing, spurious vowel regions removed based on uniformity of epoch intervals and the positions of vowel on set points, and vowel end-points are corrected using the strength of the excitation of the speech signal.• A technique is proposed for the vowel region detection from the continuous speech using an envelope of the derivative of the speech signal, which is a non-negative, frequency weighted energy operator.• A language identification system using deep neural networks with attention is explored with a vowel region detection scheme incorporated at the front-end. The spectral features extracted from the vowel regions instead of entire speech are used to model the language identification system. • An approach for automatic detection of retroflex approximant in a Tamil continuous speech is proposed using vowel regions and formant dynamics. The formant dynamics are obtained through the Hilbert envelope of the numerator of the group delay function. • A technique is proposed to detect auxiliary palatalization from the continuous Kashmiri speech. These consonants investigated in synchronous with vowel regions, which are spotted using the instantaneous energy computed from the envelope-derivative of the speech signal. • A new speech feature representation is proposed by integrating single frequency filtering technique with higher order nonlinear energy operator for emotion recognition. The performance of the system is evaluated by extracting the speech features in vowel regions instead of entire speech in noisy environments.The final remarks drawn out of this thesis work are as follows: The performance of the proposed vowel region detection methods observed to be better compared to the existing meth- ods under clean and noisy environments. The proposed techniques utilize speech production knowledge in terms of strength of the excitation and uniformity of the epoch intervals of the speech signal. Moreover, vowel region detection technique based on the envelope of the derivative of the speech signal is better compared to the method based on the zero frequency filtering technique. It is mainly due to the capturing of the instantaneous energy of the speech signal in a better manner. A front-end vowel region detection scheme incorporated to the language identification system to use spectral features extracted from the vowel regions instead of the entire speech signal. It exhibited a significant improvement in the recognition rate in noisy environments. A speaker independent retroflex approximant detection approach is proposed to detect the language-specific phonetic feature of Tamil in clean and noisy environments. In an- other study, palatalization in Kashmiri is detected by anchoring vowel regions. The technique for detecting palatalization process in the vowel context is proposed based on set of acous- tic cues. These include a second formant gradient, a measure of the second formant nearing the third formant, and high to low energy ratio derived from the spectral characteristics of the palatalized consonants using the Hilbert envelope of the numerator of the group delay function. The results showed that these acoustic cues capture secondary articulatory changes in palatalized consonants to detect them in continuous speech. Lastly, a new feature representation is proposed for emotion recognition using higher-order nonlinear energy operator and single frequency filtering technique. The performance of the proposed features is compared with state-of-the-art features extracted in the vowel regions. Experimental results show that these features outperformed Teager energy based nonlinear features and modulation spectral features in terms of recognition rate in clean and noisy environments. Keywords: Acoustic event, Epoch intervals, Frequency-dependent energy operator, Hilbert envelope of the numerator of the group delay, Palatalized consonant, Retroflex approximant, Single frequency filtering, Strength of the excitation, Vowel region.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Vowel Region based Speech Analysis and Applications

Abstract