Implicit Indian Language Identification Using Different Deep Neural Network Architectures

Author: Ravi Kumar Vuddagiri
Date: 2022-10-31
Report no: IIIT/TH/2022/123
Advisor:Anil Kumar Vuppala

Abstract

Language identification (LID) refers to the task of recognizing the language (e.g., Hindi,English, Telugu, etc.) from the spoken utterance. In a multilingual country like India, automatic LID systems play a key role in developing speech applications. Indian languages majorly belong to two language families i.e., Indo-Aryan and Dravidian. The languages from these families share a common phonetic space, which makes language identification a challenging task in the Indian scenarios. Extracting language discriminative information from speech plays a vital role in developing LID systems. Lack of large scale datasets covering various diversities such as speakers and dialects adds additional level of complexity of the task. Large scale datasets are a pre-requisite for effectively adapting the recent advancements in deep neural networks (DNNs) for developing the LID systems. Lack of this data poses an additional challenge for developing LID systems in Indian scenarios. The goal of this thesis is to improve the performance of Indian LID systems. As an initial step we have collected a speech database comprising data of 23 Indian languages. The database has 103.5 hours of speech data comprising of 1150 speakers covering multiple dialects. In this thesis, we have explored various feature extraction techniques and modelling approaches for developing LID systems. Modeling the long-temporal context is a key for developing LID systems. We have explored the use of stacked shifted delta cepstral ( Stacked-SDC ) features for developing LID systems. These features capture the long-term temporal context to improve the performance of LID systems. To effectively capture this information, a feature that can model the long-term temporal context is required. This study aims to capture the long-term temporal context by appending stacked shifted delta cepstral ( Stacked-SDC ) features. We also explore DNN with attention model for developing LID systems. These models have the sequence-summarizing capability, which helps to better model the long-temporal context. A multi-head attention mechanism is explored to develop LID systems, as having multiple heads improves the capacity of the models. The neural hidden representations derived from a joint acoustic model (JAM) of a multilingual end-to-end automatic speech recognition system are used as features for developing LID system. These features represent contextual information learned by the acoustic model. In this work, an LSTM-CTC ( Connectionist Temporal Classification ) framework is used for training the JAM models. Recently Self-supervised pre-training (wav2vec2.0) has been used to train a neural network which can generate representations with discriminative properties. We explore the use of features from the pre-trained network for developing LID system. The performance of LID systems is at its best when there is a match between training and operating environments. A large degradation is observed when there is a mismatch in the quality of the data in the presence of noise. To overcome the mismatch problem due to noisy operating environments, curriculum learning strategies are explored for training LID systems. These training methods have led to more stable LID systems that can be operated in noisy environments

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Implicit Indian Language Identification Using Different Deep Neural Network Architectures

Abstract