Towards Building a Robust Telugu ASR System for Emotional Speech

Author: Vishnu Vidyadhara Raju V
Date: 2020-10-17
Report no: IIIT/TH/2020/92
Advisor:Anil Kumar Vuppala

Abstract

The performance of speech recognition (ASR) system degrades when there is a mismatch between training and operating environments. The presence of expressive (emotional) speech is one among the mismatches in operating environments as majority of ASR systems are trained using neutral speech. The emotional state of the speaker induces changes in the speech characteristics and effects the ASR system in practical scenarios. The goal of this thesis is to improve the performance of ASR systems in these emotional conditions. The key challenge in addressing this research problem is the lack of resources, where the existing emotional databases are limited in the number of speakers and their size. The main focus of this thesis is to create the required infrastructure to study this challenging problem for low resource Telugu language and present different exploratory studies to evaluate the accuracy of Telugu ASR systems. This thesis investigates several different techniques at various stages of the recognition process that are suitable for building an emotionally robust ASR system. In the first study, prosody modification is employed at the pre-processing level of the speech recognizer. Model-based and feature-space adaptation approaches are also analyzed towards the improvement of ASR systems. These emotion adaptation strategies were studied using various deep neural network (DNNs) architectures and shown to be effective in comparison with baseline Gaussian mixture models (GMMs). The experiments are conducted using IIT Kharagpur simulated emotion speech corpus (IITKGP-SESC) and IIIT-Hyderabad Telugu naturalistic emotional speech corpus (IIIT-H TNESC). Some studies and the major conclusions are: • Prosody parameters such as pitch, duration and energy play a vital role in the analysis of emotional speech. In the first study, prosody modification is used at the pre-processing level to convert the emotional speech to neutral speech. This prosody modified emotional speech has shown an improvement in the performance of the ASR system at the preprocessing level. • In the second study, prosody modification is used to create the training data from the given neutral to emotional speech. Different ASR models were built for the generated emotional speech along with the existing neutral speech. An emotion recognition system built using differenced prosody features is used to route speech to the corresponding ASR model. • From the above studies, we observed that the emotion data used for training resulted in better performance than pre-processing emotional speech. Hence the feasibility of extending the emotion adaptation algorithms of GMM-HMM acoustic models to DNN based models is explored using model-based and feature-space adaptation approaches in the third study. The best performance is observed for TDNN based acoustic models which use utterance level decisions as their objective function instead of a standard frame level decision. • Feature space adaptation strategies have performed better than model space adaptation techniques. The auxiliary features appended to the conventional MFCCs contain emotion specific information, which helps in better handling of emotional speech by the ASR systems. Model-based adaptation could have performed well, when sufficient emotional data is provided for adaptation. fMLLR based adaptation are effective in handling the emotion-specific information in comparison with MAP adaptation for building the ASR systems.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Towards Building a Robust Telugu ASR System for Emotional Speech

Abstract