Dialect Classification and Multi-Dialect Speech Recognition

Author: Rashmi Kethireddy 20172044
Date: 2024-02-23
Report no: IIIT/TH/2024/29
Advisor:Suryakanth V Gangashetty

Abstract

Major goal of this thesis is to study the dialectal variations and improve the performance of speech recognition with an embeddings derived from improved dialect classification system. Initial studies focused on improvement of the dialect classification system with three major dialects (AU:Australian, UK:Britain, and US:American) of English. In order to improve the performance of dialect classification system and based on the analysis of dialectal variations, advanced signal processing approaches were proposed to investigate for dialect classification with traditional i-vector system. The features that provide high spectral resolution will help to capture subtle differences between dialects. So, this thesis proposed to use single frequency filtering (SFF) and zero-time windowing (ZTW) based features that provide high spectral resolution without compromising temporal resolution. Along with frame level spectral resolution, longer temporal context will constitute for dialect classification. So, approaches that enhance the temporal context of proposed features (SFF and ZTW) approaches such as delta and double delta coefficients (∆+∆∆), shifted delta coefficients (SDCs) are experimented. It is observed that dialect classification system has given promising performance with the proposed features with temporal context provided by ∆+∆∆ and SDCs. Further, signal processing approaches that can provide long temporal summarization such as frequency domain linear prediction (FDLP) are proposed for dialect classification. From experiments, with FDLP based features, it is observed that long temporal summarization provided by FDLP based features is advantageous for discriminating dialects. So, both the signal processing approaches that provide high spectral resolution (SFF and ZTW) and long temporal summarization (FDLP) have shown to give promising performance in dialect classification when compared to commonly used STFT based features. Further, due to promising performance by deep neural networks in classification tasks and its ability to provide longer temporal context, simpler (CNN) to advanced deep neural network (TCN, TDNN, and ECAPA-TDNN) architectures that provide different temporal contexts are investigated, it is observed that advanced neural network architectures improved the performance of dialect classification. Further, on evaluation of the best of both stages, it is observed that ECAPA-TDNN performed better with proposed features (SFF). The dialectal variations in speech degrade the performance of multi-dialectal automatic speech recognition (ASR) system. The embeddings derived from the best dialect classification system are applied to multi-dialect (with AU, UK, and US dialects) ASR and found to improve the performance of the ASR system. In most studies, Indian English is considered as a single dialect even though it has different native speakers. So, the inclusion of foreign dialectal embeddings improved the performance of the ASR system. The observations made in dialect classification systems with major dialects of English are extended to foreign dialect classification (i.e., native language (or L1) identification). The embeddings extracted from the improved dialect classification system are included along with the Indian English ASR system to improve the performance.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Dialect Classification and Multi-Dialect Speech Recognition

Abstract